Database Alert Design: Thresholds That Fire on Real Problems
Most database alert fatigue comes from thresholds set to catch anything unusual rather than thresholds calibrated to actual user impact. An alert that fires on every autovacuum run, every checkpoint, and every 5-second replica lag spike will be silenced by engineers within a week — and then the real incidents will go unnoticed.
Situation
Database teams accumulate alerts in one of two ways: copy default thresholds from the monitoring tool’s out-of-box configuration, or set thresholds after an incident when the previous absence of an alert was painful. Both approaches produce the wrong result.
Default thresholds are calibrated for visibility, not signal quality. They generate enough noise that teams learn to ignore them. Incident-driven thresholds overfit to a specific failure pattern and miss adjacent ones.
The right design is a two-level alert architecture: a warning level that gives the team early signal and time to investigate, and a critical level that triggers paging because user impact is already occurring or imminent.
Symptoms
| Symptom in the alert system | What it usually means |
|---|---|
| Alert fired, no incident found | Threshold is at wrong level or condition is transient and self-resolving |
| Alert fired after users already complained | Threshold is too high or measurement resolution is too low |
| Same alert fires daily at the same time | Normal batch job or backup window — suppress or add time-based exclusion |
| Alert never fires in production | Either system is very healthy, or threshold is too permissive |
| Multiple alerts fire at once for the same root cause | Missing alert correlation — downstream symptoms of a single root cause |
First Five Checks
Before setting any threshold, measure the baseline over 7 days on the production workload.
1. What is the normal replica lag distribution?
Collect replay_lag from pg_stat_replication (PostgreSQL) or Seconds_Behind_Master (MySQL) every 60 seconds for 7 days. Identify:
- Median lag during business hours
- 95th percentile lag during peak write periods
- Maximum lag during known batch jobs or backups
Set the warning threshold at 2× the 95th percentile peak. Set the critical threshold at the point where read replicas return data more than one commit cycle stale for your application’s consistency requirements — typically 60–120 seconds for OLTP, 5–15 minutes for analytics.
2. What is the normal connection utilization pattern?
-- PostgreSQL: connections used vs max
SELECT count(*) AS active,
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_conn,
ROUND(count(*) * 100.0 /
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections'), 1) AS pct_used
FROM pg_stat_activity;
Measure this every minute over 7 days. Alert at 70% (warning — time to investigate pool settings) and 85% (critical — application will soon see connection errors).
3. What does checkpoint behavior look like during normal operations?
From pg_stat_bgwriter, collect checkpoints_req over time. Zero is ideal — all checkpoints should be checkpoints_timed. Any non-zero checkpoints_req over a 5-minute period means write pressure is forcing early checkpoints. Alert when checkpoints_req > 0 for more than 3 consecutive minutes.
4. What is the slow query baseline?
Enable pg_stat_statements and measure the 95th percentile query duration for your top 20 query types over 7 days. Use this to set application-specific slow query thresholds — not a global “any query over 1 second” rule, which fires on legitimate analytical queries.
5. What does disk growth look like?
Measure database disk usage daily for 30 days and compute the trend. Alert when the projected exhaustion date (at the current growth rate) falls within 14 days. This is a warning. A critical alert triggers when the projected exhaustion falls within 3 days or when a sudden disk spike exceeds the 30-day average growth by 5×.
Decision Tree
flowchart TD
A[Alert fires] --> B{User impact?}
B -->|Users already reporting issues| C[Critical — escalate to on-call]
B -->|No user reports| D{Trending toward impact?}
D -->|Yes — within SLO window| E[Warning — investigate now]
D -->|No — transient spike| F{Is this a known pattern?}
F -->|Yes — batch job, backup, maintenance| G[Suppress for this window — add schedule exclusion]
F -->|No — unexpected| H[Investigate root cause — check pg_stat_activity and slow query log]
H --> I{Root cause identified?}
I -->|Yes| J[Fix or tune threshold — document the baseline]
I -->|No| K[Escalate with evidence package — query plans, metrics window, server log]
Alert Thresholds Reference
PostgreSQL
| Metric | Warning | Critical | Notes |
|---|---|---|---|
| Replica lag | 60s | 300s | Use replay_lag; adjust for batch job windows |
| Connection utilization | 70% of max_connections | 85% | Count only non-idle sessions for more accurate signal |
checkpoints_req | > 0 for 3 min | > 0 for 10 min | Any forced checkpoint means write pressure |
| Dead tuple ratio | 20% on tables > 100k rows | 40% | Per-table alert, not global |
| Cache hit ratio | < 97% | < 90% | Monitor pg_statio_user_tables hits vs reads |
| Table bloat (relation size growth) | 2× expected | 3× expected | Compare against 30-day baseline |
| Long-running query | > 60s | > 300s | OLTP threshold; analytical systems need separate policy |
| Idle-in-transaction session | > 5 min | > 15 min | Per-session duration, not aggregate count |
pg_stat_replication slot lag | 100 MB | 1 GB | Unused replication slots block WAL cleanup |
MySQL / Aurora MySQL
| Metric | Warning | Critical | Notes |
|---|---|---|---|
Seconds_Behind_Master | 30s | 120s | Use Aurora replica lag metric in CloudWatch for Aurora |
Threads_connected | 70% of max_connections | 85% | Threads_running spike is the lead indicator |
Innodb_buffer_pool_wait_free | > 0 per 5 min | > 100 per 5 min | Buffer pool pages not available — memory pressure |
Innodb_log_waits | > 0 per 5 min | > 10 per 5 min | Redo log full — write throughput exceeded |
| Slow query rate | 2× 7-day average | 5× 7-day average | Rate, not absolute count |
Open_tables | 80% of table_open_cache | 95% | Too-small cache causes repeated table opens |
| Lock wait timeout | > 5 per minute | > 20 per minute | High contention — check for hot rows or large transactions |
Aurora PostgreSQL / Aurora MySQL (CloudWatch-specific)
| CloudWatch metric | Warning | Critical | Notes |
|---|---|---|---|
ReplicaLag | 30s | 120s | Distinct from standard PostgreSQL; checked via CloudWatch |
DatabaseConnections | 70% of instance max | 85% | Per-instance limit, check RDS parameter group |
FreeStorageSpace | < 20 GB or < 20% | < 5 GB | Aurora storage auto-scales but billing changes |
AuroraVolumeBytesLeftTotal | < 10 TB | < 1 TB | Aurora 128 TB storage ceiling |
WriteIOPS | 2× 7-day P95 | 5× 7-day P95 | Sudden IOPS spike — check for bulk loads |
EngineUptime | — | Unexpected reset | Unexpected restart — check for OOM or crash |
Rollback Plan
If a threshold change causes alert fatigue or misses a real incident:
- Revert to the previous threshold immediately and document the direction of failure (too sensitive vs. too permissive).
- Collect a 7-day baseline at the previous threshold before making another change.
- For critical alerts, always test in staging with a simulated failure scenario before applying to production.
- Keep a changelog of threshold changes with the justification and the measurement that motivated each change.
Automation Opportunity
Alert routing automation that reduces toil:
-
Batch job suppression: automatically suppress replica lag alerts during known ETL windows (e.g., 01:00–04:00 UTC) and backup windows. Log the suppression, do not silently drop.
-
Alert correlation: when connection exhaustion and slow query alerts fire within 5 minutes of each other, group them into a single incident with both signals attached. The root cause is almost always the same event.
-
Baseline drift detection: weekly job that checks whether current metric values have permanently shifted from the thresholds set 30 days ago. If p95 is consistently higher than the warning threshold, the baseline has shifted — either the system is degrading or the workload grew.
Leadership Summary
Database alert reliability is a trust problem as much as a technical one. Teams stop responding to alerts that have false-positive rates above 20%. The two-level architecture (warning = investigate, critical = page) with calibrated per-metric thresholds keeps signal quality high enough that critical alerts are taken seriously. The measurement-first approach — setting thresholds from 7-day baselines rather than intuition — produces thresholds that reflect actual system behavior, not guesses.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Threshold set without baseline | Alert fires on normal workload variation | Measure 7-day baseline before setting any threshold |
| Global slow query threshold | Legitimate analytics queries fire alert constantly | Per-query-class thresholds or separate analytics monitoring policy |
| Alert on every autovacuum | autovacuum is working correctly but noisy | Alert on dead tuple ratio, not autovacuum event frequency |
| Missing maintenance window suppression | Backup and ETL jobs generate false positives every night | Add time-of-day or scheduled suppressions with logging |
| No test for false negatives | Team knows when alerts fire too much, but not when they miss | Simulate failure scenarios in staging quarterly to verify alert coverage |
What to Do Next
- Problem: Your database alerts either fire too often (ignored) or too late (users complain first).
- Solution: Measure 7-day baselines for the five metric groups above, then set two-level thresholds (warning, critical) calibrated to those baselines.
- Proof: Replay the last three database incidents against the proposed thresholds and verify they would have alerted at the warning level before user impact.
- Action: This week, pull 7 days of replica lag, connection utilization, and slow query data from your monitoring tool and set the two-level thresholds using the reference values above as a starting point.