If you view AI in observability as just a natural-language search bar, you are missing the shift from passive tools to autonomous on-call teammates.

Situation

Historically, observability platforms were strictly passive. They collected telemetry, triggered an alert based on a static threshold, and waited for a human to interpret the data. If a database CPU spiked, a DBA was paged. The DBA then had to open Datadog, manually correlate the CPU spike with database query metrics, check the APM traces to identify the calling service, and look at the deployment pipeline to see if code had recently changed.

The introduction of agents like Datadog Bits AI SRE fundamentally changes this contract. Bits AI is not just a search tool; it acts as an autonomous on-call teammate. When a page fires, Bits AI begins investigating in the background. By the time the human engineer acknowledges the page in Slack, the agent has already correlated the telemetry, tested multiple hypotheses, and posted a summary of its findings and suggested remediations.

Symptoms

Organizations that have not adopted autonomous incident investigation usually suffer from specific operational friction:

  • The Slack Scramble: The #incident channel is chaotic, filled with engineers posting screenshots of different graphs and asking, “Did anyone deploy?”
  • The Context Gap: A backend engineer gets paged for high latency but has no idea how to interpret the RDS metrics dashboard, leading to an unnecessary escalation to the DBA team.
  • The Cold Start: Every incident investigation starts from zero. The first 10 minutes are spent executing the exact same mental runbook (check CPU, check logs, check deployments) every single time.
  • The Post-Mortem Amnesia: After the incident, the exact sequence of graphs and logs used to diagnose the issue is lost because it only existed in an engineer’s browser history.

First Five Checks

When working with an AI SRE teammate, the DBA’s “first five checks” shift from executing queries to reviewing the agent’s autonomous workflow:

  1. Review the Incident Summary in Slack/Teams: Does the AI summary accurately describe the failure? Look for the plain-language explanation (e.g., “PostgreSQL CPU spiked to 99% due to an increase in sequential scans from the checkout service.”).

  2. Check the Correlation Engine Output: Bits AI surfaces related events. Verify if it correctly linked the database metric spike to an infrastructure change, a feature flag toggle, or a code deployment.

  3. Validate the Hypothesis: The agent will present one or more root-cause hypotheses. As the subject matter expert, you must evaluate if the agent correctly interpreted the database’s internal state machine.

  4. Review Suggested Actions: The AI will suggest remediation steps (e.g., “Roll back deployment X” or “Kill process ID 1234”). Check these for safety and correctness before executing them.

  5. Prompt for Deep Dives: If the summary is insufficient, use natural language to dig deeper: “Bits, show me the exact SQL query causing the sequential scans and the application logs from the service executing it.”

Decision Tree

The integration of an AI SRE teammate creates a new triage workflow.

flowchart TD
    A[Alert Triggers] --> B[Bits AI SRE Autonomous Investigation]
    B --> C[AI Posts Summary & Hypothesis to Slack]
    C --> D[Human Engineer Acknowledges Alert]
    D --> E{Does Human Trust Hypothesis?}
    E -->|Yes| F[Execute AI-Suggested Remediation]
    F --> F1{Did it resolve?}
    F1 -->|Yes| F2[AI Auto-Generates Post-Mortem]
    F1 -->|No| G
    
    E -->|No| G[Prompt AI for Raw Data / Traces]
    G --> H[Human Diagnoses Manually]
    H --> I[Human Executes Remediation]

Remediation Options

  1. One-Click AI Remediation (Fast, High Risk): If the AI agent provides a remediation button (e.g., triggering a runbook to restart a pod or kill a query), the engineer can execute it directly from chat.

    • Tradeoff: Removing friction makes it easy to execute dangerous actions without fully understanding the blast radius.
  2. Conversational Mitigation (Medium Speed, Guided Control): The engineer asks the AI to generate the specific CLI command or SQL query to fix the issue, reviews it, and executes it manually.

    • Tradeoff: Slightly slower, but forces the engineer to validate the exact syntax before execution.
  3. Manual Override (Slow, Complete Control): The engineer ignores the AI’s suggestions and uses standard dashboards and terminals to mitigate the issue.

    • Tradeoff: Misses the speed benefits of the AI, but necessary when the agent hallucinates or misunderstands a novel failure mode.

Rollback Plan

If an AI-suggested action exacerbates the issue, you must treat the AI as a compromised tool. Immediately revoke its ability to execute runbooks (if auto-remediation was enabled), revert the specific change manually, and switch entirely to manual diagnostic dashboards. Do not ask the AI how to fix the problem it just caused.

Automation Opportunity

The greatest automation opportunity is the post-mortem. Bits AI observes the entire incident timeline—what graphs were viewed, what logs were queried, and what commands were run. It can automatically generate the first draft of the incident timeline and post-mortem document, saving the DBA hours of toil and ensuring the organizational memory of the incident is accurate.

Leadership Summary

  • Agents Reduce MTTA (Mean Time To Acknowledge): By putting a correlated summary directly in the chat window, engineers can acknowledge and begin acting on an incident immediately.
  • Democratizing Database Diagnostics: An AI SRE allows backend engineers to triage basic database issues without instantly escalating to a senior DBA, lowering the on-call burden.
  • The ChatOps Evolution: ChatOps is no longer about typing /deploy in Slack. It is about having a conversational interface with your entire observability stack.

What to Do Next

  • Problem: AI-assisted triage is adopted as a natural-language search bar, missing its core value: autonomous hypothesis generation that begins before the human acknowledges the page — without this, you’ve added a chat interface but not reduced time-to-diagnosis.
  • Solution: Configure Bits AI SRE (or equivalent) to start autonomous investigation the moment a database alert triggers, route the correlated summary to the incident Slack channel before the first human response, and mandate that all deployments and feature flag changes stream to Datadog as tagged events for correlation.
  • Proof: During the next incident review, measure whether the AI hypothesis matched the actual root cause and whether it arrived before an engineer would have independently reached the same conclusion — accuracy and lead time together determine whether this tool is reducing MTTR.
  • Action: Configure your three highest-frequency database alerts to automatically trigger a Bits AI investigation chain this sprint, and require the AI-generated post-mortem draft to be reviewed before the next retrospective.