Datadog DBM: What Database Teams Should Actually Monitor

Datadog Database Monitoring (DBM) will happily show you every query, every plan, and every host metric your fleet produces. The trap is treating “more telemetry” as “better observability.” The teams who get value from DBM monitor a short list of signals tied to decisions — and deliberately ignore the rest, because in DBM the rest is also a line on the bill.

Problem

A team turns on Datadog DBM expecting clarity and gets a firehose: thousands of normalized queries, host dashboards, plan samples, and a steadily climbing Datadog invoice. Six weeks later the on-call engineer still can’t answer “why was the database slow at 2am?” any faster than before, because the dashboards show everything and therefore foreground nothing. Meanwhile DBM is now a noticeable cost itself — host-based DBM pricing plus custom metrics plus log ingestion. Observability that you pay for but don’t act on is just a second cost problem stacked on the first.

Why it matters financially

Observability spend is real spend, and DBM has several meters running at once:

Per-host DBM scales with your fleet — every replica and non-prod instance you instrument adds cost, whether or not anyone reads its dashboard.
Custom metrics bill per unique metric+tag combination. High-cardinality tags (per-user, per-request-id) can multiply a single metric into thousands of billable timeseries.
Log ingestion and retention for slow-query and audit logs add a third meter.

The financial point cuts both ways: under-monitoring means you can’t see the cost and reliability problems that matter (the theme of every other article in this series), while naïve monitoring means you pay to collect telemetry nobody uses. The goal is the small set of signals that actually change a decision.

Technical root causes (why DBM bills and dashboards balloon)

Instrumenting everything by default — every non-prod and idle replica gets a DBM host agent.
High-cardinality custom metrics — tagging metrics with unbounded values (user IDs, request IDs) explodes billable timeseries.
Collecting without alerting — query samples and metrics gathered but wired to no alert and no runbook.
Symptom-level alerts — “host CPU high” instead of leading indicators (replication lag, connection saturation, storage runway).
No baseline — without a normal range, dashboards can’t tell you whether 2am was abnormal.

Review checklist — what DBM should be answering

Monitor signals tied to a decision. At minimum:

Top queries by total time and by I/O — the same pg_stat_statements view DBM surfaces fleet-wide; this is your cost and latency hot list.
Replication lag — with a defined normal range and a threshold alert (not just a graph).
Connection saturation — active vs max_connections, alerted before the limit.
Storage runway — free space / days-to-full, alerted with lead time.
Cache hit ratio and deadlocks/lock waits — early signals of memory pressure and contention.
Long-running / idle-in-transaction — the transactions that block vacuum and cause incidents.

And on the cost side of DBM itself:

Which hosts are instrumented — are idle replicas and non-prod paying for DBM they don’t need?
Are any custom metrics high-cardinality? Check your top metrics by timeseries count.
For every collected signal: is there an alert and a runbook? If not, why collect it?

Example findings

(Illustrative — the patterns these reviews repeatedly surface.)

DBM was enabled on every host including 6 idle non-prod replicas; scoping DBM to production and active readers cut DBM host cost without losing a single useful dashboard.
A custom metric tagged with request_id had ballooned into tens of thousands of billable timeseries; dropping the unbounded tag collapsed it to a handful.
The team had rich query dashboards but no alert on replication lag — the one signal that would have warned them before a read-after-write incident.
Slow-query logs were ingested and retained for 30 days but never queried; trimming retention cut log cost with no operational loss.

Actions to take

Define the decision for every signal. If a metric or log maps to no alert and no runbook, stop paying to collect it (or sample it).
Scope DBM to what you act on. Production and active replicas first; instrument non-prod only when you’re actively debugging it.
Kill high-cardinality tags. Audit top custom metrics by timeseries count; remove unbounded tag values.
Alert on leading indicators, not symptoms. Replication lag, connection saturation, storage runway, long-running transactions — each with a threshold and an owner.
Establish a baseline so “is this abnormal?” has a data answer.
Re-check DBM’s own cost as a line item — observability is worth paying for; paying for noise is not.

Good database observability and a controlled observability bill are the same discipline as the rest of cost engineering: collect what answers a question, alert on what you’ll act on, and measure the cost of the tooling itself.

Review checklist & next step

Use the free 30-Point Database Cost Review Checklist — its Observability section maps directly to the signals above. To see how observability gaps show up in a full review, read the Acme SaaS sample report.

Want your monitoring assessed against the questions that matter? AKS runs a Database Observability Review — what to collect, what to alert on, and what you’re paying to gather but never use. Or get in touch to scope a pilot.

Problem

Why it matters financially

Technical root causes (why DBM bills and dashboards balloon)

Review checklist — what DBM should be answering

Example findings

Actions to take

Review checklist & next step

Rajiv

Related Posts

How to Run a Database Cost & Reliability Review

PostgreSQL Bloat, Index Waste, and Cloud Cost

Why Database Engineers Should Care About AI Cost Engineering