Database Alert Design: Thresholds That Fire on Real Problems

Most database alert fatigue comes from thresholds set to catch anything unusual rather than thresholds calibrated to actual user impact. An alert that fires on every autovacuum run, every checkpoint, and every 5-second replica lag spike will be silenced by engineers within a week — and then the real incidents will go unnoticed.

Situation

Database teams accumulate alerts in one of two ways: copy default thresholds from the monitoring tool’s out-of-box configuration, or set thresholds after an incident when the previous absence of an alert was painful. Both approaches produce the wrong result.

Default thresholds are calibrated for visibility, not signal quality. They generate enough noise that teams learn to ignore them. Incident-driven thresholds overfit to a specific failure pattern and miss adjacent ones.

The right design is a two-level alert architecture: a warning level that gives the team early signal and time to investigate, and a critical level that triggers paging because user impact is already occurring or imminent.

Symptoms

Symptom in the alert system	What it usually means
Alert fired, no incident found	Threshold is at wrong level or condition is transient and self-resolving
Alert fired after users already complained	Threshold is too high or measurement resolution is too low
Same alert fires daily at the same time	Normal batch job or backup window — suppress or add time-based exclusion
Alert never fires in production	Either system is very healthy, or threshold is too permissive
Multiple alerts fire at once for the same root cause	Missing alert correlation — downstream symptoms of a single root cause

First Five Checks

Before setting any threshold, measure the baseline over 7 days on the production workload.

1. What is the normal replica lag distribution?

Collect replay_lag from pg_stat_replication (PostgreSQL) or Seconds_Behind_Master (MySQL) every 60 seconds for 7 days. Identify:

Median lag during business hours
95th percentile lag during peak write periods
Maximum lag during known batch jobs or backups

Set the warning threshold at 2× the 95th percentile peak. Set the critical threshold at the point where read replicas return data more than one commit cycle stale for your application’s consistency requirements — typically 60–120 seconds for OLTP, 5–15 minutes for analytics.

2. What is the normal connection utilization pattern?

-- PostgreSQL: connections used vs max
SELECT count(*) AS active,
       (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_conn,
       ROUND(count(*) * 100.0 /
             (SELECT setting::int FROM pg_settings WHERE name = 'max_connections'), 1) AS pct_used
FROM pg_stat_activity;

Measure this every minute over 7 days. Alert at 70% (warning — time to investigate pool settings) and 85% (critical — application will soon see connection errors).

3. What does checkpoint behavior look like during normal operations?

From pg_stat_bgwriter, collect checkpoints_req over time. Zero is ideal — all checkpoints should be checkpoints_timed. Any non-zero checkpoints_req over a 5-minute period means write pressure is forcing early checkpoints. Alert when checkpoints_req > 0 for more than 3 consecutive minutes.

4. What is the slow query baseline?

Enable pg_stat_statements and measure the 95th percentile query duration for your top 20 query types over 7 days. Use this to set application-specific slow query thresholds — not a global “any query over 1 second” rule, which fires on legitimate analytical queries.

5. What does disk growth look like?

Measure database disk usage daily for 30 days and compute the trend. Alert when the projected exhaustion date (at the current growth rate) falls within 14 days. This is a warning. A critical alert triggers when the projected exhaustion falls within 3 days or when a sudden disk spike exceeds the 30-day average growth by 5×.

Decision Tree

flowchart TD
    A[Alert fires] --> B{User impact?}
    B -->|Users already reporting issues| C[Critical — escalate to on-call]
    B -->|No user reports| D{Trending toward impact?}
    D -->|Yes — within SLO window| E[Warning — investigate now]
    D -->|No — transient spike| F{Is this a known pattern?}
    F -->|Yes — batch job, backup, maintenance| G[Suppress for this window — add schedule exclusion]
    F -->|No — unexpected| H[Investigate root cause — check pg_stat_activity and slow query log]
    H --> I{Root cause identified?}
    I -->|Yes| J[Fix or tune threshold — document the baseline]
    I -->|No| K[Escalate with evidence package — query plans, metrics window, server log]

Alert Thresholds Reference

PostgreSQL

Metric	Warning	Critical	Notes
Replica lag	60s	300s	Use `replay_lag`; adjust for batch job windows
Connection utilization	70% of `max_connections`	85%	Count only non-idle sessions for more accurate signal
`checkpoints_req`	> 0 for 3 min	> 0 for 10 min	Any forced checkpoint means write pressure
Dead tuple ratio	20% on tables > 100k rows	40%	Per-table alert, not global
Cache hit ratio	< 97%	< 90%	Monitor `pg_statio_user_tables` hits vs reads
Table bloat (relation size growth)	2× expected	3× expected	Compare against 30-day baseline
Long-running query	> 60s	> 300s	OLTP threshold; analytical systems need separate policy
Idle-in-transaction session	> 5 min	> 15 min	Per-session duration, not aggregate count
`pg_stat_replication` slot lag	100 MB	1 GB	Unused replication slots block WAL cleanup

MySQL / Aurora MySQL

Metric	Warning	Critical	Notes
`Seconds_Behind_Master`	30s	120s	Use Aurora replica lag metric in CloudWatch for Aurora
`Threads_connected`	70% of `max_connections`	85%	`Threads_running` spike is the lead indicator
`Innodb_buffer_pool_wait_free`	> 0 per 5 min	> 100 per 5 min	Buffer pool pages not available — memory pressure
`Innodb_log_waits`	> 0 per 5 min	> 10 per 5 min	Redo log full — write throughput exceeded
Slow query rate	2× 7-day average	5× 7-day average	Rate, not absolute count
`Open_tables`	80% of `table_open_cache`	95%	Too-small cache causes repeated table opens
Lock wait timeout	> 5 per minute	> 20 per minute	High contention — check for hot rows or large transactions

Aurora PostgreSQL / Aurora MySQL (CloudWatch-specific)

CloudWatch metric	Warning	Critical	Notes
`ReplicaLag`	30s	120s	Distinct from standard PostgreSQL; checked via CloudWatch
`DatabaseConnections`	70% of instance max	85%	Per-instance limit, check RDS parameter group
`FreeStorageSpace`	< 20 GB or < 20%	< 5 GB	Aurora storage auto-scales but billing changes
`AuroraVolumeBytesLeftTotal`	< 10 TB	< 1 TB	Aurora 128 TB storage ceiling
`WriteIOPS`	2× 7-day P95	5× 7-day P95	Sudden IOPS spike — check for bulk loads
`EngineUptime`	—	Unexpected reset	Unexpected restart — check for OOM or crash

Rollback Plan

If a threshold change causes alert fatigue or misses a real incident:

Revert to the previous threshold immediately and document the direction of failure (too sensitive vs. too permissive).
Collect a 7-day baseline at the previous threshold before making another change.
For critical alerts, always test in staging with a simulated failure scenario before applying to production.
Keep a changelog of threshold changes with the justification and the measurement that motivated each change.

Automation Opportunity

Alert routing automation that reduces toil:

Batch job suppression: automatically suppress replica lag alerts during known ETL windows (e.g., 01:00–04:00 UTC) and backup windows. Log the suppression, do not silently drop.
Alert correlation: when connection exhaustion and slow query alerts fire within 5 minutes of each other, group them into a single incident with both signals attached. The root cause is almost always the same event.
Baseline drift detection: weekly job that checks whether current metric values have permanently shifted from the thresholds set 30 days ago. If p95 is consistently higher than the warning threshold, the baseline has shifted — either the system is degrading or the workload grew.

Leadership Summary

Database alert reliability is a trust problem as much as a technical one. Teams stop responding to alerts that have false-positive rates above 20%. The two-level architecture (warning = investigate, critical = page) with calibrated per-metric thresholds keeps signal quality high enough that critical alerts are taken seriously. The measurement-first approach — setting thresholds from 7-day baselines rather than intuition — produces thresholds that reflect actual system behavior, not guesses.

Where It Breaks

Failure mode	Trigger	Fix
Threshold set without baseline	Alert fires on normal workload variation	Measure 7-day baseline before setting any threshold
Global slow query threshold	Legitimate analytics queries fire alert constantly	Per-query-class thresholds or separate analytics monitoring policy
Alert on every autovacuum	autovacuum is working correctly but noisy	Alert on dead tuple ratio, not autovacuum event frequency
Missing maintenance window suppression	Backup and ETL jobs generate false positives every night	Add time-of-day or scheduled suppressions with logging
No test for false negatives	Team knows when alerts fire too much, but not when they miss	Simulate failure scenarios in staging quarterly to verify alert coverage

What to Do Next

Problem: Your database alerts either fire too often (ignored) or too late (users complain first).
Solution: Measure 7-day baselines for the five metric groups above, then set two-level thresholds (warning, critical) calibrated to those baselines.
Proof: Replay the last three database incidents against the proposed thresholds and verify they would have alerted at the warning level before user impact.
Action: This week, pull 7 days of replica lag, connection utilization, and slow query data from your monitoring tool and set the two-level thresholds using the reference values above as a starting point.