The End of Single-Signal Alerting: Correlating Metrics, Logs, Traces, Deployments, and Cost
If you wake an engineer up at 3 AM because a single metric crossed an arbitrary line on a graph, you are training them to ignore your monitoring system.
Situation
For years, the standard operating procedure for database monitoring was to define a static threshold for every hardware metric. If CPU utilization crossed 85% for five minutes, page the on-call DBA. If disk space dropped below 20%, page the on-call DBA. If memory utilization hit 90%, page the on-call DBA.
This approach creates an endless stream of noise. An 85% CPU utilization on a database during a nightly batch processing window is not an incident; it is a highly efficient use of provisioned resources. Conversely, a database running at 30% CPU might be completely broken if a connection pool limit is blocking all incoming traffic. A modern observability architecture must abandon single-signal alerting in favor of multi-signal correlation.
Symptoms
A platform relying on single-signal alerts is easy to identify by its operational dysfunction:
- The Boy Who Cried Wolf: The on-call engineer receives 50 pages a week, acknowledges them from their phone without opening a laptop, and goes back to sleep because “it always does that at midnight.”
- The Missing Context: A page fires for “High Database Latency,” but the alert contains no information about which service is experiencing the latency, forcing the engineer to start the investigation from scratch.
- The Silent Outage: The application is completely down because a bad deployment pushed a malformed SQL query. The database CPU is at 2%, so no database alerts fire, leaving the DBA team unaware of the incident until an escalation occurs.
- The Cost Surprise: A misconfigured ORM starts executing a Cartesian join, driving massive I/O throughput. No availability alert fires because the database absorbs the load, but the monthly AWS bill spikes by $10,000.
First Five Checks
To move to correlated alerting, you must evaluate your existing monitors against these five criteria:
-
Check for User Impact: Does the alert measure a symptom experienced by a user? (e.g., API latency > 500ms) If it only measures an internal resource (e.g., CPU > 85%), it should be a warning, not a page.
-
Correlate with Traffic Volume: Is the metric anomaly correlated with a drop in request volume? If database latency is high but request volume has dropped to zero, the load balancer is likely the true root cause, not the database.
-
Check for Recent Deployments: Can the alerting engine overlay deployment events on the metric graph? If a metric spikes within 5 minutes of a code rollout, the alert payload must explicitly state: “Possible cause: Deployment v1.2.3.”
-
Correlate with Error Logs: Are high-severity logs increasing concurrently with the metric anomaly? An I/O spike accompanied by
OOMKilledlogs tells a completely different story than an I/O spike with zero error logs. -
Evaluate Cost Implications: Is the anomalous behavior driving variable costs? If a sudden change in query shape causes read units in DynamoDB to spike, the alert must correlate the operational metric with the financial impact.
Decision Tree
When designing a new alert, use this logic to ensure it relies on correlated signals rather than isolated noise:
flowchart TD
A[Design New Alert] --> B{Does this metric measure User Impact?}
B -->|No| C[Is resource exhaustion imminent < 2 hours?]
C -->|No| D[Log as Warning / Triage Next Day]
C -->|Yes| E[Require Secondary Correlation]
B -->|Yes| E
E --> F{Is there a concurrent anomaly?}
F -->|Log Errors| G[Page: High Latency + App Errors]
F -->|Deploy Event| H[Page: High Latency + Recent Deploy]
F -->|Cost Spike| I[Page: High Latency + Burning Budget]
F -->|No| J[Page: Degradation, Unknown Cause]
Remediation Options
-
Implement Service Level Objectives (SLOs) (High Impact, High Effort): Replace infrastructure alerts with error budget burn-rate alerts. You only page the engineer when the error rate or latency violates the mathematical agreement made with the business.
- Tradeoff: Requires a cultural shift and significant engineering effort to define, measure, and agree upon SLOs across product and engineering teams.
-
Build Composite Monitors (Medium Impact, Medium Effort): Configure your observability platform to trigger an alert only when
Metric A AND Metric Bare true (e.g.,CPU > 85% AND API 5xx Errors > 5%).- Tradeoff: Composite logic can become brittle and difficult to maintain as application architectures evolve.
-
Mute Non-Actionable Alerts (Fast, High Reward): Audit the last 30 days of pages. Any alert that was consistently acknowledged and resolved without action must be downgraded to a Slack notification or deleted entirely.
- Tradeoff: The team must overcome the fear of “what if we miss something,” leaning into the philosophy that alert noise is a bigger risk than a dropped signal.
Rollback Plan
If you transition to correlated alerting and discover a critical failure mode was missed because the secondary correlation (e.g., the log stream) was delayed or broken, you must temporarily reinstate the broad single-signal alerts. Do not leave the system blind while you fix the correlation engine.
Automation Opportunity
Automate the correlation payload. When an alert fires, trigger a Lambda function or webhook that queries the APM traces, pulls the last 10 minutes of error logs, fetches the most recent deployment commit hash, and appends all this context to the PagerDuty ticket before it wakes the engineer. The engineer should open the ticket and immediately see a correlated narrative, not just a bare metric.
Leadership Summary
- Alerts Must Require Action: If an alert fires and the correct response is “wait and see,” the alert is fundamentally broken.
- Context is King: The difference between a 5-minute MTTR and a 2-hour MTTR is often just the presence of deployment and log context directly inside the alert payload.
- Protect the On-Call Engineer: Alert fatigue causes burnout and missed critical failures. Ruthlessly defend your team’s attention by demanding multi-signal correlation for any high-urgency page.
What to Do Next
- Problem: Single-signal alerts — CPU > 85%, latency > 500ms — train engineers to ignore the pager because the threshold has no relationship to user impact or required action, which means the one alert that matters gets the same treatment as the 49 that didn’t need action.
- Solution: Require every page-worthy alert to pass an actionability review before deployment: what is the exact runbook step the engineer executes when this fires? If no runbook exists, the alert should not page.
- Proof: Convert your highest-volume infrastructure alert to a composite requiring a concurrent spike in application error rate before paging — then measure the weekly alert volume reduction. If volume doesn’t drop by at least 30%, the alert was already correlated with real incidents and the baseline was accurate.
- Action: Audit the last 30 days of pager history this week. Delete any alert consistently acknowledged and auto-resolved without action. Every surviving alert must have a runbook link in the payload — no runbook, no page.