AI Agent Observability: Monitor Tool Calls, Token Spend, Latency, and Failure Loops

If you give an AI agent access to production databases without monitoring its tool calls, context growth, and token spend, you are not building an SRE automation platform—you are building an autonomous denial-of-service engine.

Situation

Over the past two years, the observability landscape has shifted dramatically. In 2024, the priority was establishing a baseline of deterministic metrics: CPU saturation, query latency, connection pool utilization, and replication lag. In 2025, the industry moved to AI-assisted operations, using generative AI to correlate static alarms with log streams and deployment events to reduce human alert fatigue.

In 2026, the paradigm has shifted again. Engineering teams are no longer just using AI to read dashboards; they are deploying autonomous SRE agents that act on the infrastructure. These agents possess read/write access to production environments via secure toolchains. They can spin up read replicas, terminate blocking queries, and modify auto-scaling group parameters.

However, this autonomy introduces entirely new failure domains. An autonomous agent does not fail by crashing like a traditional microservice. It fails by hallucinating parameters, getting stuck in recursive retry loops, exhausting its context window, or burning through API token budgets at astronomical speeds. CloudWatch and Datadog have evolved to provide built-in generative AI observability, but platform engineers must understand how to architect these monitors. Monitoring an agent is fundamentally different than monitoring an application.

The Problem

Traditional observability relies on the predictability of code execution. A Python script executing a database query will do the exact same thing every time it runs. If it fails, it throws a deterministic exception, logs a stack trace, and exits.

Agents are non-deterministic. Driven by Large Language Models (LLMs), an agent decides its execution path at runtime based on the prompt, the context, and the output of its previous actions.

This non-determinism creates several novel failure modes that cannot be caught by a standard APM trace:

The Recursive Retry Loop: An agent executes a database query that returns a syntax error. Instead of failing, the agent attempts to fix the syntax and retries. If the agent’s logic is flawed, it may rewrite and retry the query 500 times in a matter of minutes, driving up database CPU and consuming massive token budgets.
Context Window Saturation: An agent is tasked with analyzing database logs. It executes a read_logs tool that returns 100,000 lines of raw text. The agent’s context window fills up, causing it to “forget” its original instructions, leading to unpredictable, erratic tool calls.
Tool Hallucination: An agent needs to scale a database instance. It hallucinates a tool name (scale_rds_cluster) that does not exist, or it calls a valid tool (execute_sql) with hallucinated arguments (a table name that doesn’t exist).
The Latency Trap: Human operators expect API calls to return in milliseconds. An LLM might take 15 seconds to generate the tokens for a complex reasoning step. If the agent is orchestrating a time-sensitive failover, this latency can lead to cascading timeouts in the downstream systems waiting for the agent’s decision.

AI Agent Observability Architecture

To safely operate an SRE agent, you must construct an observability pipeline specifically designed for LLM telemetry. Every action the agent takes must be captured, parsed, and evaluated in real-time.

The Five Pillars of Agent Telemetry

Model Invocation Metrics: Track the specific model version (e.g., claude-3-5-sonnet-20241022), the input tokens, the output tokens, and the raw inference latency.
Tool Execution Traces: Log the exact name of the tool called, the JSON arguments provided by the model, the execution time of the tool itself, and the raw string returned to the model.
Context Growth Tracking: Monitor the total size of the conversation array (in tokens) as it grows. Alert when the context approaches 80% of the model’s maximum window.
Loop Detection States: Track the number of consecutive identical tool calls or the number of sequential errors encountered without a successful action.
Cost Attribution: Calculate the real-time financial cost of the agent’s session based on token usage and associate it with an incident ID or team budget.

In Practice

The documented pattern for surviving agent deployments at scale involves treating the agent as a highly privileged, easily confused human operator.

Context: Anthropic’s documentation on Claude’s tool use describes how a model can enter a retry loop when a tool returns an error — the model will attempt to reformulate the tool call based on the error response, which can produce many sequential calls if the underlying failure is not transient (Anthropic tool use docs). Without an external loop-detection mechanism, this behavior is by design: the model has no native “give up after N retries” instruction that reliably survives context pressure.

Action: The documented mitigation is to instrument tool execution at the application layer using OpenTelemetry spans that track consecutive error counts independently of the LLM. The counter must be deterministic code in the agent harness, not a prompt instruction, because the LLM’s self-awareness of its own error rate degrades as the context window fills with error messages.

Result: A hard token budget limit enforced at the LLM client wrapper layer — not inside the prompt — is the only reliable mechanism to prevent runaway cost from recursive retry loops. AgentConsecutiveErrors is a custom metric that the agent orchestration code must publish explicitly; no cloud provider exposes this natively because it is a semantic signal about agent behavior, not a standard infrastructure metric.

Learning: The minimum viable kill switch for any production agent deployment is: (1) a custom metric tracking consecutive tool failures, (2) an alarm at threshold 3, and (3) a handler that suspends the agent process, revokes its execution credentials, and pages a human with the full session transcript.

Decision Tree

When building telemetry for an autonomous agent, use this logic to design your monitoring strategy:

flowchart TD
    A[Agent Session Starts] --> B[Log Initial Prompt & Context]
    B --> C[Agent Generates Action]
    C --> D{Is it a Tool Call?}
    D -->|Yes| E[Trace Tool Name & Arguments]
    E --> F[Execute Tool]
    F --> G{Did the Tool Error?}
    G -->|Yes| H[Increment Error Counter]
    H --> H1{Error Count > Threshold?}
    H1 -->|Yes| I[Suspend Agent & Page Human]
    H1 -->|No| J[Append Error to Context, Retry LLM]
    G -->|No| K[Reset Error Counter, Append Result to Context]
    K --> L{Is Context > 80% Capacity?}
    L -->|Yes| M[Trigger Context Summarization Routine]
    L -->|No| N[Continue Session]
    D -->|No| O[Agent Provides Final Answer]

Remediation Options

Implement Hard Token Limits (Fast, Low Risk): Configure your LLM client wrapper to hard-stop execution if a single agent session exceeds a predefined token budget (e.g., 100,000 tokens).
- Tradeoff: The agent will abruptly fail in the middle of complex incidents, requiring human intervention. However, it prevents runaway cost spirals.
Deploy Context Summarization (Medium Speed, High Value): When the agent’s context window reaches 70% capacity, automatically inject a system prompt that forces the agent to summarize its findings so far, clear the raw execution history, and continue with only the summary.
- Tradeoff: The agent loses access to the granular raw data of its early steps, which might cause it to repeat an action it already tried.
Enforce Schema Validation on Tool Calls (High Impact, High Effort): Before passing a hallucinated tool argument to your infrastructure, intercept the JSON payload and validate it against a strict JSON Schema. If it fails, do not execute the tool; return a schema validation error directly to the agent.
- Tradeoff: Requires maintaining explicit schemas for every operational tool, which slows down the addition of new capabilities.

Rollback Plan

If an agent exhibits rogue behavior—such as continuously modifying auto-scaling groups or dropping legitimate connections—the rollback mechanism must bypass the agent entirely. Every agent architecture must include a “Kill Switch” API. Invoking the kill switch immediately revokes the IAM role assumed by the agent’s worker environment, severing its access to the infrastructure. The human engineer then assumes control using standard operational runbooks.

Automation Opportunity

Build an “Agent Supervisor” process. This is a lightweight, deterministic script (not an LLM) that tails the agent’s telemetry stream in real-time. If the supervisor detects that the agent has spent more than $5 in API calls without successfully resolving the incident, or if the agent has called the same read-only tool five times in a row, the supervisor automatically terminates the agent process, reverts any infrastructure modifications the agent made during the session, and escalates the ticket to a human SRE.

Leadership Summary

Agents are Not Software, They are Employees: You would not give a junior engineer root access to a database and walk away. You would monitor their commands, review their logs, and cap their spending. Treat AI agents with the exact same skepticism.
Cost is an Engineering Metric: With LLMs, compute cost is directly tied to the length of the incident. A long, struggling agent session is not just slow; it is financially expensive.
Observability Must be Deterministic: Do not use an AI to monitor your AI. The supervisor systems that detect infinite loops and token exhaustion must be rigid, deterministic code that relies on explicit thresholds.

What to Do Next

Problem: An AI agent with write access to production infrastructure and no loop detection, token budget limit, or kill switch is an autonomous denial-of-service engine — a recursive retry loop can exhaust database capacity and API token budgets before any human intervenes.
Solution: Treat every agent session as a billable, privilege-bearing process: emit OpenTelemetry spans for every tool call with execution latency and argument hashes, implement a deterministic supervisor that suspends the agent on consecutive failures (the supervisor must be code, not a prompt), and enforce hard token budget limits with automatic human escalation.
Proof: Run a game day providing the agent a tool that always returns 500. Verify loop-detection fires within three retries and a human is paged with the full session transcript — if loop detection doesn’t fire, the agent will retry until the token budget is gone.
Action: Add a custom metric that increments on each agent tool-call failure, set an alarm at threshold 3 for consecutive failures, and wire it to suspend the agent and page on-call — this is the minimum viable kill switch for any production agent deployment.

Situation

The Problem

AI Agent Observability Architecture

The Five Pillars of Agent Telemetry

In Practice

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

Leadership Summary

What to Do Next

Rajiv

Related Posts

Agentic SRE Architecture: Skills, Agents, MCP Servers, and Human Approval Loops

MCP Server Observability: The New Control Plane for AI + Enterprise Tools

Telemetry Cost Control: Why Observability Data Itself Needs Governance