Datadog Database Monitoring: PostgreSQL, MySQL, and Aurora Setup
Datadog Database Monitoring is not just metrics collection with a nicer UI — it ships query-level explain plans, wait event breakdown, and connection pool visibility without requiring pg_stat_statements configuration or custom PromQL recording rules. The mistake is enabling it and leaving all sampling and explain plan collection at defaults, which produces query data that is too sparse to diagnose production slowdowns.
Situation
Teams running Datadog for application performance monitoring have a strong reason to use it for database monitoring too: one dashboard, one query language, and automatic correlation between slow application traces and the database queries those traces hit. The alternative — running a separate Prometheus stack with postgres_exporter, custom recording rules, and Grafana — is operationally heavier for teams that are not already Prometheus-native.
Datadog Database Monitoring (DBM) covers PostgreSQL, MySQL, Aurora PostgreSQL, Aurora MySQL, SQL Server, and Oracle. This post focuses on PostgreSQL and MySQL/Aurora MySQL — the two most common open-source targets.
The challenge is not installation. The challenge is that defaults produce incomplete data: explain plans are sampled at a low rate, wait event tracking requires explicit enabling, and the Agent needs database-side configuration (a dedicated monitoring user with the right grants) that Datadog’s quickstart guide underspecifies.
Symptoms
| Symptom in Datadog DBM | Likely cause |
|---|---|
| Query samples show “no explain plan available” | pg_stat_statements not in shared_preload_libraries, or explain plan sampling rate is too low |
| Slow query visible in APM but not in DBM | Query duration is below DBM’s configured min duration threshold |
| Wait events show only “ClientRead” | track_activity_query_size too small; truncating queries before DBM can match them |
| Aurora read replicas not appearing in DBM | Agent not configured to connect to the reader endpoint separately |
| High DBM Agent CPU on the database host | Explain plan collection running too frequently; throttle via explain_statement_min_duration |
Connection count in DBM does not match pg_stat_activity | DBM is reading from pg_stat_activity but the monitoring user lacks pg_monitor role |
First Five Checks
1. Is the monitoring user configured with the right grants?
For PostgreSQL:
CREATE USER datadog WITH password 'use-secret-manager-here';
GRANT pg_monitor TO datadog;
-- Required for query samples and explain plans:
CREATE SCHEMA datadog;
GRANT USAGE ON SCHEMA datadog TO datadog;
GRANT USAGE ON SCHEMA public TO datadog;
GRANT pg_read_all_stats TO datadog;
-- Function required for DBM explain plan collection:
CREATE OR REPLACE FUNCTION datadog.explain_statement(
l_query TEXT,
OUT explain JSON
)
RETURNS SETOF JSON AS $$
DECLARE
curs REFCURSOR;
plan JSON;
BEGIN
OPEN curs FOR EXECUTE pg_catalog.concat('EXPLAIN (FORMAT JSON) ', l_query);
FETCH curs INTO plan;
CLOSE curs;
RETURN QUERY SELECT plan;
END;
$$
LANGUAGE 'plpgsql'
RETURNS NULL ON NULL INPUT
SECURITY DEFINER;
The SECURITY DEFINER function is required because DBM collects explain plans for queries run by other users — the monitoring role does not have execution rights on arbitrary user queries.
For MySQL/Aurora MySQL:
CREATE USER 'datadog'@'%' IDENTIFIED WITH mysql_native_password BY 'use-secret-manager-here';
GRANT REPLICATION CLIENT ON *.* TO 'datadog'@'%';
GRANT PROCESS ON *.* TO 'datadog'@'%';
GRANT SELECT ON performance_schema.* TO 'datadog'@'%';
-- For explain plan collection:
GRANT SELECT ON sys.* TO 'datadog'@'%';
2. Is pg_stat_statements enabled?
SHOW shared_preload_libraries;
-- Must include 'pg_stat_statements'
-- If missing, add to postgresql.conf and restart:
-- shared_preload_libraries = 'pg_stat_statements'
-- After restart, verify:
SELECT * FROM pg_extension WHERE extname = 'pg_stat_statements';
-- If absent: CREATE EXTENSION pg_stat_statements;
-- Tune:
ALTER SYSTEM SET pg_stat_statements.max = 10000;
ALTER SYSTEM SET pg_stat_statements.track = 'all';
ALTER SYSTEM SET track_activity_query_size = 4096;
SELECT pg_reload_conf();
track_activity_query_size defaults to 1024 bytes in PostgreSQL 13 and earlier. Queries longer than this are truncated in pg_stat_activity, which prevents DBM from matching query samples to their explain plans.
3. Is the Datadog Agent configured for DBM?
In /etc/datadog-agent/conf.d/postgres.d/conf.yaml:
init_config:
instances:
- host: your-db-host
port: 5432
username: datadog
password: ENC[your-secret] # use Datadog secret management
dbname: your_database
# Enable Database Monitoring:
dbm: true
# Query metrics — increase statement cache:
query_metrics:
enabled: true
# Query samples — how often to collect explain plans:
query_samples:
enabled: true
explain_statement_min_duration: 500 # ms — only collect plans for queries over 500ms
samples_per_second: 1 # Reduce if CPU pressure on the Agent host
# Wait events (PostgreSQL 9.6+):
query_activity:
enabled: true
collection_interval: 10 # seconds
tags:
- env:production
- service:your-app
- db_engine:postgres
For MySQL:
instances:
- host: your-mysql-host
user: datadog
pass: ENC[your-secret]
port: 3306
dbm: true
query_metrics:
enabled: true
query_samples:
enabled: true
explain_statement_min_duration: 500
query_activity:
enabled: true
4. Are explain plans being collected?
In Datadog UI: APM → Database Monitoring → Query Samples. Filter to your database host. If queries show “no explain plan,” verify:
- The
datadog.explain_statementfunction exists in the target database explain_statement_min_durationis not set too high (default 5000ms misses most slow OLTP queries — set to 500ms)- The query is not a DDL or
COPYstatement (explain plans are not collected for these) - The Agent’s
datadoguser hasUSAGEon the schema where the queried tables live
5. Are wait events visible?
In Datadog UI: Database Monitoring → Query Metrics → click a query → Wait Events tab. If the tab is empty:
- Verify
query_activity.enabled: trueinconf.yaml - Verify the
datadoguser haspg_monitorrole - Check Agent logs:
datadog-agent check postgres— look for errors on thepg_stat_activitycollection
Decision Tree
flowchart TD
A[Set up Datadog DBM] --> B[Create monitoring user with correct grants]
B --> C{PostgreSQL or MySQL?}
C -->|PostgreSQL| D[Enable pg_stat_statements — add to shared_preload_libraries]
C -->|MySQL| E[Grant SELECT on performance_schema and sys]
D --> F[Create datadog.explain_statement SECURITY DEFINER function]
E --> G[Set dbm:true in Agent conf.yaml]
F --> G
G --> H[Set explain_statement_min_duration to 500ms]
H --> I[Enable query_activity for wait events]
I --> J{Verify data appears}
J -->|Query samples empty| K[Check pg_stat_statements.track — set to all — check track_activity_query_size]
J -->|No explain plans| L[Verify explain_statement function — check USAGE grant on all schemas]
J -->|No wait events| M[Verify pg_monitor grant — check query_activity.enabled in conf.yaml]
J -->|All data visible| N[Set alert thresholds on p99 query latency and connection saturation]
Rollback Plan
If DBM is causing database load:
- Reduce
query_samples.samples_per_secondto0.1or disable query sampling entirely:query_samples.enabled: false. Query metrics (without explain plans) have minimal database impact. - Increase
explain_statement_min_durationto 2000ms to reduce explain plan frequency. - If the monitoring connection itself is causing connection count pressure, reduce Agent check frequency:
min_collection_interval: 30(seconds). - Disable
query_activitycollection if thepg_stat_activityquery is slow on instances with many databases or connections. - The
datadog.explain_statementfunction runsEXPLAINon sampled queries. On very high-throughput databases, this adds measurable load. Disable plan collection and rely on query metrics only if the database is already under pressure.
Automation Opportunity
-
Provision monitoring user via Terraform: manage the
datadogPostgreSQL user and grants through the same Terraform module that provisions the database. Store the password in AWS Secrets Manager or Vault, not in the Agent config file directly. -
Agent configuration as code: manage
conf.yamlthrough Ansible or a Helm chart value. Theexplain_statement_min_durationthreshold andcollection_intervalsettings should be tunable per environment without touching the Agent host directly. -
Alert from DBM metrics: create Datadog monitors on:
postgresql.connections> 80% ofmax_connections— warning; 90% criticalpostgresql.replication.delay> 60s warning; 300s criticalpostgresql.queries.avg_timeP99 spike > 2× baseline — warningmysql.replication.seconds_behind_master> 30s warning; null = critical (broken replication)
Leadership Summary
Datadog Database Monitoring closes the gap between APM traces and database behavior. When an application trace is slow, DBM lets the team click through to the specific SQL, its explain plan at the time of the slowdown, and the wait events that show what the database was waiting on. Without DBM configured correctly — with the right grants, pg_stat_statements enabled, track_activity_query_size large enough, and explain plan sampling at a useful threshold — the team gets query metrics but not query diagnostics. The setup work is one-time; the operational benefit is continuous.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Explain plans absent for short queries | explain_statement_min_duration set to 5000ms (default) | Lower to 500ms for OLTP databases |
| Truncated queries in DBM | track_activity_query_size too small | Set to 4096 in postgresql.conf |
| Aurora read replicas not in DBM | Each endpoint is a separate instance | Add a separate instances: entry for the reader endpoint in conf.yaml |
SECURITY DEFINER function security concern | Function runs EXPLAIN as superuser equivalent | Limit the function to read-only plans only — the function only calls EXPLAIN, not EXECUTE |
| DBM adds one extra connection per Agent | On databases near max_connections, Agent connection pushes over the limit | Reserve connections for monitoring: set max_connections 10 higher than application pool max |
pg_stat_statements reset on restart | Cumulative counters reset; DBM shows spike | Set pg_stat_statements.save = on; use rate metrics in Datadog, not raw counters |
What to Do Next
- Problem: Your database is visible in Datadog as infrastructure metrics but slow queries are not linked to their explain plans or wait events.
- Solution: Enable DBM with the monitoring user grants above, set
explain_statement_min_durationto 500ms, and verifypg_stat_statementsis loaded. - Proof: After setup, trigger a known slow query and verify it appears in Query Samples with an explain plan attached within 60 seconds.
- Action: This week, create the
datadogmonitoring user, add theSECURITY DEFINERexplain function, and setdbm: truein the Agent config. Restart the Agent and verify query samples appear in the Datadog UI within 5 minutes.