Agentic SRE Architecture: Skills, Agents, MCP Servers, and Human Approval Loops

If you wire a large language model directly to your production database with root credentials and a prompt that says “fix any issues,” you are begging for a resume-generating event.

Situation

We have traced the evolution of database observability over three distinct eras. In 2024, the industry focused on standardizing the dashboard foundation—tracking saturation, locks, and lag through deterministic systems like Datadog, Prometheus, and CloudWatch. In 2025, the focus shifted to AI-assisted operations, using generative AI to compress the noise of 500 alerts into a single, correlated, natural-language root-cause hypothesis.

Now, in 2026, we have reached the era of Agentic Site Reliability Engineering (SRE). Instead of a human engineer reading an AI-generated summary and clicking buttons in a runbook, networks of specialized AI agents observe the telemetry, diagnose the failure, debate the tradeoff, formulate a remediation plan, and execute it.

However, building an Agentic SRE architecture is not about giving a single omnipotent LLM access to your infrastructure. It requires a distributed systems approach: deploying highly scoped, read-only specialist agents that communicate over standard protocols (like MCP), leading to a rigid, deterministic human-in-the-loop approval gate.

The Problem

When organizations attempt to implement autonomous operations, they typically make three architectural mistakes:

The God Agent: They deploy a single agent with a massive context window and give it access to every tool—from querying the database to restarting Kubernetes nodes. When an incident occurs, the agent gets confused by the sheer volume of available actions, hallucinates arguments, and executes the wrong command.
The Implicit Write Access: They grant the agent a single database role that has both SELECT and DROP privileges. During a frantic triage session, the agent accidentally executes a destructive command while trying to clear a temporary table.
The Unverifiable Execution: They allow the agent to execute remediation plans silently. When the system recovers (or crashes), the human engineering team has no audit trail of what the agent actually did, making post-mortems impossible.

Agentic SRE Reference Architecture

A production-grade Agentic SRE architecture breaks the incident lifecycle into isolated, highly constrained stages.

The Detector Agent: This is not an LLM. It is a deterministic alerting engine (e.g., Prometheus Alertmanager or CloudWatch Alarms) that monitors p99 latency and error rates. When an SLO is violated, it triggers the orchestration pipeline.
The Diagnosis Agent (Read-Only): This agent has a single purpose: data gathering. It connects to the database via an MCP Server using a strict READ_ONLY role. It executes queries against pg_stat_activity or Performance Insights, pulls the last 10 minutes of logs, and formulates a hypothesis.
The Remediation Planner Agent: This agent takes the hypothesis from the Diagnosis Agent and cross-references it with the company’s approved runbook repository. It generates a step-by-step CLI or SQL script to fix the issue. It does not execute the script.
The Human Approval Loop: The Planner Agent posts the proposed script to a dedicated Slack channel or PagerDuty incident. A human engineer reviews the exact commands, verifies the blast radius, and clicks “Approve.”
The Executor Automation: Once approved, a deterministic CI/CD pipeline or automation runner (not an LLM) executes the script against the infrastructure and reports the result back to the chat.

In Practice

The documented pattern for safe autonomous operations relies on multi-agent debate and explicit change windows.

Context: AWS has published architecture guidance on human-in-the-loop patterns for autonomous agents in the Amazon Bedrock documentation, specifically recommending that agents performing potentially destructive operations route through an approval workflow rather than executing directly — to preserve the change management controls required by compliance frameworks (Amazon Bedrock: human in the loop).

Action: The documented architectural principle for safe agentic operations is that agents should never hold both diagnostic and execution authority in the same process. A read-only Diagnosis Agent and a write-enabled Executor are two separate components with separate IAM roles — the data gathered by the Diagnosis Agent passes through a human approval step before the Executor ever receives an execution credential.

Result: This separation enforces that the human engineer’s role becomes approval-based rather than command-based: during an incident, the engineer’s job shifts from typing SQL commands to evaluating whether the agent’s proposed script matches the blast-radius description provided by the Diagnosis Agent.

Learning: Open Policy Agent (OPA) or a similar policy engine can automate the first-pass script validation — rejecting anything containing DROP, TRUNCATE, or cross-account resource modifications — leaving the human to arbitrate edge cases, not obvious rejections. The human approval gate is not a workaround for agent limitations; it is the safety boundary that makes autonomous SRE deployable in regulated environments.

Decision Tree

When architecting the control flow for an autonomous incident response, enforce strict boundaries at every transition.

flowchart TD
    A[Deterministic Alert Fires] --> B[Diagnosis Agent Initiated]
    B --> C[Agent Calls Read-Only MCP Tools]
    C --> D[Agent Generates Hypothesis]
    D --> E[Remediation Planner Agent Initiated]
    E --> F[Planner Maps Hypothesis to Approved Runbook]
    F --> G[Planner Generates Exact Execution Script]
    G --> H[Human Approval Gate]
    H --> H1{Human Approves?}
    H1 -->|No| I[Human Takes Manual Control]
    H1 -->|Yes| J[Deterministic Automation Executes Script]
    J --> K[Verify Recovery via Telemetry]
    K --> K1{Is System Healthy?}
    K1 -->|Yes| L[Generate Post-Mortem]
    K1 -->|No| I

Remediation Options

Supervised Execution (Medium Speed, Zero Risk): The architecture strictly enforces the Human Approval Gate. The agents only draft the plan; the human executes it.
- Tradeoff: MTTR (Mean Time to Resolve) is bottlenecked by the human’s ability to wake up, read the Slack message, and click approve.
Auto-Approve for Known Runbooks (Fast, Medium Risk): If the Remediation Planner maps the issue to an explicitly whitelisted runbook (e.g., “Add 10% disk capacity to volume”), the system skips the Human Approval Gate and executes it immediately, simply notifying the human after the fact.
- Tradeoff: Requires absolute trust in the Diagnosis Agent’s ability to correctly classify the failure. If the agent misclassifies an application bug as a disk space issue, it will waste money scaling disks unnecessarily.
Complete Autonomy (Extremely Fast, Catastrophic Risk): The agent writes dynamic scripts on the fly and executes them against the database without mapping to pre-approved runbooks or seeking human approval.
- Tradeoff: Unacceptable for production database environments. This pattern violates every principle of SRE change management and auditability.

Rollback Plan

The defining feature of a mature Agentic SRE architecture is that the agent is never allowed to define the rollback plan. The deterministic CI/CD pipeline that executes the agent’s script must inherently know how to revert the state (e.g., if the agent modifies a Terraform variable to increase an instance size, the pipeline simply git reverts the commit if the health checks fail post-deployment). Never ask an LLM to fix a production outage that the LLM itself just caused.

Automation Opportunity

Automate the guardrails, not just the actions. Build a “Policy Engine” (like Open Policy Agent) that intercepts the execution scripts drafted by the Remediation Planner. If the script contains forbidden keywords (DROP, TRUNCATE, DELETE) or attempts to modify resources outside the explicit scope of the current incident, the Policy Engine hard-rejects the plan before the Human Approval phase is even reached.

Leadership Summary

Agents are Planners, Pipelines are Executors: Never give an LLM an API key with write access to AWS or your database. Give the LLM the ability to write a script, and make a deterministic pipeline execute it.
Specialization Beats Generalization: A team of five agents (Diagnosis, Cost, Security, Remediation, Reviewer) arguing with each other over an MCP bus will produce a safer outcome than one massive agent trying to do it all.
The Human Becomes the Approver: The future of database engineering is not typing SQL queries during an outage. It is reviewing the SQL queries generated by your AI counterparts and clicking “Approve.”

What to Do Next

Problem: A single “god agent” with write access to all infrastructure creates an incident response architecture where the agent can compound the original failure — a hallucinated argument or misclassified failure mode makes the outage dramatically worse with no human checkpoint.
Solution: Separate the incident lifecycle into specialist roles with hard privilege boundaries: read-only Diagnosis Agent (never writes), Remediation Planner (generates but never executes), deterministic automation runner (executes only human-approved scripts from a pre-defined runbook schema).
Proof: Take your most common recurring incident, build a pipeline where the Diagnosis Agent detects the issue and drafts the exact fix — if the human approval review takes more than 5 minutes, the Planner’s output isn’t specific enough and the runbook schema needs tightening.
Action: Map your three most common recurring database incidents into machine-readable JSON runbook schemas this week — agents can only execute against schemas, not PDF documents, and this is the prerequisite before any production autonomous SRE capability is deployable.

Situation

The Problem

Agentic SRE Reference Architecture

In Practice

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

Leadership Summary

What to Do Next

Rajiv

Related Posts

MCP Server Observability: The New Control Plane for AI + Enterprise Tools

AI Agent Observability: Monitor Tool Calls, Token Spend, Latency, and Failure Loops

Telemetry Cost Control: Why Observability Data Itself Needs Governance