An agent that cannot be evaluated is not automation; it is an expensive suggestion engine. Database and cloud teams already trust tests more than explanations. A backup job is not healthy because it prints success; it is healthy because restore validation passes. A migration is not safe because the pull request sounds careful; it is safe because the lock behavior, rollback path, and application checks are known before production.

Situation

Database and cloud teams already trust tests more than explanations. A backup job is not healthy because it prints success; it is healthy because restore validation passes. A migration is not safe because the pull request sounds careful; it is safe because the lock behavior, rollback path, and application checks are known before production.

The pattern matters for database, cloud, and platform teams because agents do not operate in a vacuum. They inherit repository rules, tool permissions, deployment workflows, incident history, and the quality of the evidence available to them.

Operating layerDefault approachBetter alternative
ContextRely on a long prompt or chat historyGive the agent task-specific evidence and rules
ToolingExpose broad tools and inspect laterExpose narrow tools with clear approval boundaries
VerificationRead the final answerCheck the artifact, trace, and final state

The Problem

Agent work has the same requirement, but most teams still review natural-language transcripts. That works while the agent drafts prose. It fails when the agent triages incidents, writes SQL, edits Terraform, or proposes cloud changes. The transcript can be plausible while the final state is wrong.

The practical question is not whether an agent can produce a convincing response. The question is whether the engineering system around that response makes the work observable, reversible, and reviewable.

Failure pointWhat breaksWhy it matters
Weak boundaryAgent authority is broader than the taskA diagnostic run can become an unsafe change
Missing evidenceThe agent cannot cite the state it usedReview becomes opinion instead of verification
No lifecycleThe workflow ends at a messageOwnership, audit, cleanup, and rollback disappear

Agent Eval Harness

For DB teams, the eval unit should be a production-shaped task: diagnose blocking, classify replication lag, identify the bad query plan, reject an unsafe migration, or prove a backup restore works.

flowchart TD
    A[task request — bounded intent] --> B[agent eval harness — controls]
    B --> C[tool execution — evidence collected]
    C --> D[verification — final state checked]
    D --> E[human handoff — audit retained]
  1. Define the operating boundary.
    Write down the task class, allowed tools, environment, data class, and approval mode before the agent runs.

  2. Shape the evidence.
    Return compact observations instead of raw dumps. The agent should see enough to reason, but not so much that context is wasted.

  3. Require proof of completion.
    Completion should be an artifact or state check: a passing test, a reviewed plan, a valid rollback, a trace, or a linked ticket.

Create a golden set of historical incidents and expected outcomes. Each eval should contain the starting evidence, the allowed tools, the expected final state, and the grader that decides pass or fail.

In Practice

Context: Anthropic describes agent evals as harnesses that run tasks, collect the model’s steps, grade the result, and aggregate performance. The important shift is from judging a single answer to measuring repeatable task outcomes. Source: Anthropic, Demystifying evals for AI agents.

Action: Create a golden set of historical incidents and expected outcomes. Each eval should contain the starting evidence, the allowed tools, the expected final state, and the grader that decides pass or fail.

Result: The same incident can be replayed across model versions, prompt changes, and tool changes. If the agent regresses, the harness shows which task class broke.

Learning: For DB teams, the eval unit should be a production-shaped task: diagnose blocking, classify replication lag, identify the bad query plan, reject an unsafe migration, or prove a backup restore works. This is a documented pattern or a direct consequence of how the named systems behave, not a fabricated production story.

Where It Breaks

Failure modeTriggerFix
Transcript gradingReviewer asks whether the answer sounded rightGrade final state, not prose
Tiny eval setOnly three happy-path tasks are testedUse incident-shaped cases across failure classes
Leaky toolsEval has tools unavailable in productionMatch eval permissions to real deployment modes
No negative casesAgent never sees unsafe migrations or ambiguous alertsAdd reject and escalate cases

What to Do Next

  • Problem: Agent work has the same requirement, but most teams still review natural-language transcripts. That works while the agent drafts prose. It fails when the agent triages incidents, writes SQL, edits Terraform, or proposes cloud changes. The transcript can be plausible while the final state is wrong.
  • Solution: For DB teams, the eval unit should be a production-shaped task: diagnose blocking, classify replication lag, identify the bad query plan, reject an unsafe migration, or prove a backup restore works.
  • Proof: The same incident can be replayed across model versions, prompt changes, and tool changes. If the agent regresses, the harness shows which task class broke.
  • Action: Take five resolved database incidents and turn each into an eval with input evidence, allowed tools, expected outcome, and a pass or fail grader.

The teams that get value from agents will not be the teams with the longest prompts. They will be the teams that turn agent work into a controlled engineering workflow.