Evals Are the New Unit Tests for Agents
Content reflects the state as of January 2026. AI tooling and model capabilities in this area change frequently.
An agent that cannot be evaluated is not automation; it is an expensive suggestion engine. Database and cloud teams already trust tests more than explanations. A backup job is not healthy because it prints success; it is healthy because restore validation passes. A migration is not safe because the pull request sounds careful; it is safe because the lock behavior, rollback path, and application checks are known before production.
Situation
Database and cloud teams already trust tests more than explanations. A backup job is not healthy because it prints success; it is healthy because restore validation passes. A migration is not safe because the pull request sounds careful; it is safe because the lock behavior, rollback path, and application checks are known before production.
The pattern matters for database, cloud, and platform teams because agents do not operate in a vacuum. They inherit repository rules, tool permissions, deployment workflows, incident history, and the quality of the evidence available to them.
| Operating layer | Default approach | Better alternative |
|---|---|---|
| Context | Rely on a long prompt or chat history | Give the agent task-specific evidence and rules |
| Tooling | Expose broad tools and inspect later | Expose narrow tools with clear approval boundaries |
| Verification | Read the final answer | Check the artifact, trace, and final state |
The Problem
Agent work has the same requirement, but most teams still review natural-language transcripts. That works while the agent drafts prose. It fails when the agent triages incidents, writes SQL, edits Terraform, or proposes cloud changes. The transcript can be plausible while the final state is wrong.
The practical question is not whether an agent can produce a convincing response. The question is whether the engineering system around that response makes the work observable, reversible, and reviewable.
| Failure point | What breaks | Why it matters |
|---|---|---|
| Weak boundary | Agent authority is broader than the task | A diagnostic run can become an unsafe change |
| Missing evidence | The agent cannot cite the state it used | Review becomes opinion instead of verification |
| No lifecycle | The workflow ends at a message | Ownership, audit, cleanup, and rollback disappear |
Agent Eval Harness
For DB teams, the eval unit should be a production-shaped task: diagnose blocking, classify replication lag, identify the bad query plan, reject an unsafe migration, or prove a backup restore works.
flowchart TD
A[task request — bounded intent] --> B[agent eval harness — controls]
B --> C[tool execution — evidence collected]
C --> D[verification — final state checked]
D --> E[human handoff — audit retained]
-
Define the operating boundary.
Write down the task class, allowed tools, environment, data class, and approval mode before the agent runs. -
Shape the evidence.
Return compact observations instead of raw dumps. The agent should see enough to reason, but not so much that context is wasted. -
Require proof of completion.
Completion should be an artifact or state check: a passing test, a reviewed plan, a valid rollback, a trace, or a linked ticket.
Create a golden set of historical incidents and expected outcomes. Each eval should contain the starting evidence, the allowed tools, the expected final state, and the grader that decides pass or fail.
In Practice
Context: Anthropic describes agent evals as harnesses that run tasks, collect the model’s steps, grade the result, and aggregate performance. The important shift is from judging a single answer to measuring repeatable task outcomes. Source: Anthropic, Demystifying evals for AI agents.
Action: Create a golden set of historical incidents and expected outcomes. Each eval should contain the starting evidence, the allowed tools, the expected final state, and the grader that decides pass or fail.
Result: The same incident can be replayed across model versions, prompt changes, and tool changes. If the agent regresses, the harness shows which task class broke.
Learning: For DB teams, the eval unit should be a production-shaped task: diagnose blocking, classify replication lag, identify the bad query plan, reject an unsafe migration, or prove a backup restore works. This is a documented pattern or a direct consequence of how the named systems behave, not a fabricated production story.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Transcript grading | Reviewer asks whether the answer sounded right | Grade final state, not prose |
| Tiny eval set | Only three happy-path tasks are tested | Use incident-shaped cases across failure classes |
| Leaky tools | Eval has tools unavailable in production | Match eval permissions to real deployment modes |
| No negative cases | Agent never sees unsafe migrations or ambiguous alerts | Add reject and escalate cases |
What to Do Next
- Problem: Agent work has the same requirement, but most teams still review natural-language transcripts. That works while the agent drafts prose. It fails when the agent triages incidents, writes SQL, edits Terraform, or proposes cloud changes. The transcript can be plausible while the final state is wrong.
- Solution: For DB teams, the eval unit should be a production-shaped task: diagnose blocking, classify replication lag, identify the bad query plan, reject an unsafe migration, or prove a backup restore works.
- Proof: The same incident can be replayed across model versions, prompt changes, and tool changes. If the agent regresses, the harness shows which task class broke.
- Action: Take five resolved database incidents and turn each into an eval with input evidence, allowed tools, expected outcome, and a pass or fail grader.
The teams that get value from agents will not be the teams with the longest prompts. They will be the teams that turn agent work into a controlled engineering workflow.