Evals Are the New Unit Tests for Agents

An agent that cannot be evaluated is not automation; it is an expensive suggestion engine. Database and cloud teams already trust tests more than explanations. A backup job is not healthy because it prints success; it is healthy because restore validation passes. A migration is not safe because the pull request sounds careful; it is safe because the lock behavior, rollback path, and application checks are known before production.

Situation

Database and cloud teams already trust tests more than explanations. A backup job is not healthy because it prints success; it is healthy because restore validation passes. A migration is not safe because the pull request sounds careful; it is safe because the lock behavior, rollback path, and application checks are known before production.

The pattern matters for database, cloud, and platform teams because agents do not operate in a vacuum. They inherit repository rules, tool permissions, deployment workflows, incident history, and the quality of the evidence available to them.

Operating layer	Default approach	Better alternative
Context	Rely on a long prompt or chat history	Give the agent task-specific evidence and rules
Tooling	Expose broad tools and inspect later	Expose narrow tools with clear approval boundaries
Verification	Read the final answer	Check the artifact, trace, and final state

The Problem

Agent work has the same requirement, but most teams still review natural-language transcripts. That works while the agent drafts prose. It fails when the agent triages incidents, writes SQL, edits Terraform, or proposes cloud changes. The transcript can be plausible while the final state is wrong.

The practical question is not whether an agent can produce a convincing response. The question is whether the engineering system around that response makes the work observable, reversible, and reviewable.

Failure point	What breaks	Why it matters
Weak boundary	Agent authority is broader than the task	A diagnostic run can become an unsafe change
Missing evidence	The agent cannot cite the state it used	Review becomes opinion instead of verification
No lifecycle	The workflow ends at a message	Ownership, audit, cleanup, and rollback disappear

Agent Eval Harness

For DB teams, the eval unit should be a production-shaped task: diagnose blocking, classify replication lag, identify the bad query plan, reject an unsafe migration, or prove a backup restore works.

flowchart TD
    A[task request — bounded intent] --> B[agent eval harness — controls]
    B --> C[tool execution — evidence collected]
    C --> D[verification — final state checked]
    D --> E[human handoff — audit retained]

Define the operating boundary.
Write down the task class, allowed tools, environment, data class, and approval mode before the agent runs.
Shape the evidence.
Return compact observations instead of raw dumps. The agent should see enough to reason, but not so much that context is wasted.
Require proof of completion.
Completion should be an artifact or state check: a passing test, a reviewed plan, a valid rollback, a trace, or a linked ticket.

Create a golden set of historical incidents and expected outcomes. Each eval should contain the starting evidence, the allowed tools, the expected final state, and the grader that decides pass or fail.

In Practice

Context: Anthropic describes agent evals as harnesses that run tasks, collect the model’s steps, grade the result, and aggregate performance. The important shift is from judging a single answer to measuring repeatable task outcomes. Source: Anthropic, Demystifying evals for AI agents.

Action: Create a golden set of historical incidents and expected outcomes. Each eval should contain the starting evidence, the allowed tools, the expected final state, and the grader that decides pass or fail.

Result: The same incident can be replayed across model versions, prompt changes, and tool changes. If the agent regresses, the harness shows which task class broke.

Learning: For DB teams, the eval unit should be a production-shaped task: diagnose blocking, classify replication lag, identify the bad query plan, reject an unsafe migration, or prove a backup restore works. This is a documented pattern or a direct consequence of how the named systems behave, not a fabricated production story.

Where It Breaks

Failure mode	Trigger	Fix
Transcript grading	Reviewer asks whether the answer sounded right	Grade final state, not prose
Tiny eval set	Only three happy-path tasks are tested	Use incident-shaped cases across failure classes
Leaky tools	Eval has tools unavailable in production	Match eval permissions to real deployment modes
No negative cases	Agent never sees unsafe migrations or ambiguous alerts	Add reject and escalate cases

What to Do Next

Problem: Agent work has the same requirement, but most teams still review natural-language transcripts. That works while the agent drafts prose. It fails when the agent triages incidents, writes SQL, edits Terraform, or proposes cloud changes. The transcript can be plausible while the final state is wrong.
Solution: For DB teams, the eval unit should be a production-shaped task: diagnose blocking, classify replication lag, identify the bad query plan, reject an unsafe migration, or prove a backup restore works.
Proof: The same incident can be replayed across model versions, prompt changes, and tool changes. If the agent regresses, the harness shows which task class broke.
Action: Take five resolved database incidents and turn each into an eval with input evidence, allowed tools, expected outcome, and a pass or fail grader.

The teams that get value from agents will not be the teams with the longest prompts. They will be the teams that turn agent work into a controlled engineering workflow.

Situation

The Problem

Agent Eval Harness

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Agent Productivity Depends on Context Throughput

AI Cost Incident Runbook: What to Do When Monthly Token Spend Suddenly Doubles

Agent-to-Agent Review Loops