Most database alert fatigue comes from thresholds set to catch anything unusual rather than thresholds calibrated to actual user impact. An alert that fires on every autovacuum run, every checkpoint, and every 5-second replica lag spike will be silenced by engineers within a week — and then the real incidents will go unnoticed.

Situation

Database teams accumulate alerts in one of two ways: copy default thresholds from the monitoring tool’s out-of-box configuration, or set thresholds after an incident when the previous absence of an alert was painful. Both approaches produce the wrong result.

Default thresholds are calibrated for visibility, not signal quality. They generate enough noise that teams learn to ignore them. Incident-driven thresholds overfit to a specific failure pattern and miss adjacent ones.

The right design is a two-level alert architecture: a warning level that gives the team early signal and time to investigate, and a critical level that triggers paging because user impact is already occurring or imminent.

Symptoms

Symptom in the alert systemWhat it usually means
Alert fired, no incident foundThreshold is at wrong level or condition is transient and self-resolving
Alert fired after users already complainedThreshold is too high or measurement resolution is too low
Same alert fires daily at the same timeNormal batch job or backup window — suppress or add time-based exclusion
Alert never fires in productionEither system is very healthy, or threshold is too permissive
Multiple alerts fire at once for the same root causeMissing alert correlation — downstream symptoms of a single root cause

First Five Checks

Before setting any threshold, measure the baseline over 7 days on the production workload.

1. What is the normal replica lag distribution?

Collect replay_lag from pg_stat_replication (PostgreSQL) or Seconds_Behind_Master (MySQL) every 60 seconds for 7 days. Identify:

  • Median lag during business hours
  • 95th percentile lag during peak write periods
  • Maximum lag during known batch jobs or backups

Set the warning threshold at 2× the 95th percentile peak. Set the critical threshold at the point where read replicas return data more than one commit cycle stale for your application’s consistency requirements — typically 60–120 seconds for OLTP, 5–15 minutes for analytics.

2. What is the normal connection utilization pattern?

-- PostgreSQL: connections used vs max
SELECT count(*) AS active,
       (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_conn,
       ROUND(count(*) * 100.0 /
             (SELECT setting::int FROM pg_settings WHERE name = 'max_connections'), 1) AS pct_used
FROM pg_stat_activity;

Measure this every minute over 7 days. Alert at 70% (warning — time to investigate pool settings) and 85% (critical — application will soon see connection errors).

3. What does checkpoint behavior look like during normal operations?

From pg_stat_bgwriter, collect checkpoints_req over time. Zero is ideal — all checkpoints should be checkpoints_timed. Any non-zero checkpoints_req over a 5-minute period means write pressure is forcing early checkpoints. Alert when checkpoints_req > 0 for more than 3 consecutive minutes.

4. What is the slow query baseline?

Enable pg_stat_statements and measure the 95th percentile query duration for your top 20 query types over 7 days. Use this to set application-specific slow query thresholds — not a global “any query over 1 second” rule, which fires on legitimate analytical queries.

5. What does disk growth look like?

Measure database disk usage daily for 30 days and compute the trend. Alert when the projected exhaustion date (at the current growth rate) falls within 14 days. This is a warning. A critical alert triggers when the projected exhaustion falls within 3 days or when a sudden disk spike exceeds the 30-day average growth by 5×.

Decision Tree

flowchart TD
    A[Alert fires] --> B{User impact?}
    B -->|Users already reporting issues| C[Critical — escalate to on-call]
    B -->|No user reports| D{Trending toward impact?}
    D -->|Yes — within SLO window| E[Warning — investigate now]
    D -->|No — transient spike| F{Is this a known pattern?}
    F -->|Yes — batch job, backup, maintenance| G[Suppress for this window — add schedule exclusion]
    F -->|No — unexpected| H[Investigate root cause — check pg_stat_activity and slow query log]
    H --> I{Root cause identified?}
    I -->|Yes| J[Fix or tune threshold — document the baseline]
    I -->|No| K[Escalate with evidence package — query plans, metrics window, server log]

Alert Thresholds Reference

PostgreSQL

MetricWarningCriticalNotes
Replica lag60s300sUse replay_lag; adjust for batch job windows
Connection utilization70% of max_connections85%Count only non-idle sessions for more accurate signal
checkpoints_req> 0 for 3 min> 0 for 10 minAny forced checkpoint means write pressure
Dead tuple ratio20% on tables > 100k rows40%Per-table alert, not global
Cache hit ratio< 97%< 90%Monitor pg_statio_user_tables hits vs reads
Table bloat (relation size growth)2× expected3× expectedCompare against 30-day baseline
Long-running query> 60s> 300sOLTP threshold; analytical systems need separate policy
Idle-in-transaction session> 5 min> 15 minPer-session duration, not aggregate count
pg_stat_replication slot lag100 MB1 GBUnused replication slots block WAL cleanup

MySQL / Aurora MySQL

MetricWarningCriticalNotes
Seconds_Behind_Master30s120sUse Aurora replica lag metric in CloudWatch for Aurora
Threads_connected70% of max_connections85%Threads_running spike is the lead indicator
Innodb_buffer_pool_wait_free> 0 per 5 min> 100 per 5 minBuffer pool pages not available — memory pressure
Innodb_log_waits> 0 per 5 min> 10 per 5 minRedo log full — write throughput exceeded
Slow query rate2× 7-day average5× 7-day averageRate, not absolute count
Open_tables80% of table_open_cache95%Too-small cache causes repeated table opens
Lock wait timeout> 5 per minute> 20 per minuteHigh contention — check for hot rows or large transactions

Aurora PostgreSQL / Aurora MySQL (CloudWatch-specific)

CloudWatch metricWarningCriticalNotes
ReplicaLag30s120sDistinct from standard PostgreSQL; checked via CloudWatch
DatabaseConnections70% of instance max85%Per-instance limit, check RDS parameter group
FreeStorageSpace< 20 GB or < 20%< 5 GBAurora storage auto-scales but billing changes
AuroraVolumeBytesLeftTotal< 10 TB< 1 TBAurora 128 TB storage ceiling
WriteIOPS2× 7-day P955× 7-day P95Sudden IOPS spike — check for bulk loads
EngineUptimeUnexpected resetUnexpected restart — check for OOM or crash

Rollback Plan

If a threshold change causes alert fatigue or misses a real incident:

  1. Revert to the previous threshold immediately and document the direction of failure (too sensitive vs. too permissive).
  2. Collect a 7-day baseline at the previous threshold before making another change.
  3. For critical alerts, always test in staging with a simulated failure scenario before applying to production.
  4. Keep a changelog of threshold changes with the justification and the measurement that motivated each change.

Automation Opportunity

Alert routing automation that reduces toil:

  1. Batch job suppression: automatically suppress replica lag alerts during known ETL windows (e.g., 01:00–04:00 UTC) and backup windows. Log the suppression, do not silently drop.

  2. Alert correlation: when connection exhaustion and slow query alerts fire within 5 minutes of each other, group them into a single incident with both signals attached. The root cause is almost always the same event.

  3. Baseline drift detection: weekly job that checks whether current metric values have permanently shifted from the thresholds set 30 days ago. If p95 is consistently higher than the warning threshold, the baseline has shifted — either the system is degrading or the workload grew.

Leadership Summary

Database alert reliability is a trust problem as much as a technical one. Teams stop responding to alerts that have false-positive rates above 20%. The two-level architecture (warning = investigate, critical = page) with calibrated per-metric thresholds keeps signal quality high enough that critical alerts are taken seriously. The measurement-first approach — setting thresholds from 7-day baselines rather than intuition — produces thresholds that reflect actual system behavior, not guesses.

Where It Breaks

Failure modeTriggerFix
Threshold set without baselineAlert fires on normal workload variationMeasure 7-day baseline before setting any threshold
Global slow query thresholdLegitimate analytics queries fire alert constantlyPer-query-class thresholds or separate analytics monitoring policy
Alert on every autovacuumautovacuum is working correctly but noisyAlert on dead tuple ratio, not autovacuum event frequency
Missing maintenance window suppressionBackup and ETL jobs generate false positives every nightAdd time-of-day or scheduled suppressions with logging
No test for false negativesTeam knows when alerts fire too much, but not when they missSimulate failure scenarios in staging quarterly to verify alert coverage

What to Do Next

  • Problem: Your database alerts either fire too often (ignored) or too late (users complain first).
  • Solution: Measure 7-day baselines for the five metric groups above, then set two-level thresholds (warning, critical) calibrated to those baselines.
  • Proof: Replay the last three database incidents against the proposed thresholds and verify they would have alerted at the warning level before user impact.
  • Action: This week, pull 7 days of replica lag, connection utilization, and slow query data from your monitoring tool and set the two-level thresholds using the reference values above as a starting point.