Evaluate AI Agents by Completed Work, Not Token Price
Content reflects the state as of March 2025. AI tooling and model capabilities in this area change frequently.
Per-token pricing is the wrong abstraction for AI agents because agents do not sell tokens; they either finish work or create review debt. A large language model, or LLM, predicts and generates text, while an AI agent wraps that model with tools such as browsers, shells, document editors, and code runners. The default approach is token-price comparison; the better approach is task-level evaluation, where GPT-5.5, GPT-5.4, Claude Opus, or any other model is judged by completed work.
Situation
Agentic systems are moving from chat windows into real production workflows: Codex modifying repos, browser-use agents clicking through applications, Claude Desktop calling Model Context Protocol servers, and document agents producing Word, PowerPoint, and spreadsheet artifacts. The pressure is no longer “which model is cheapest per million tokens?” It is “which model finishes the task with the least total operational cost?”
A token is a chunk of text, not a word. Roughly, 1,000 English tokens is about 750 words, so token budgets, context windows, subscription limits, and weekly usage caps are different measurements that should not be casually mixed.
| Token-price comparison | Task-level agent evaluation | |
|---|---|---|
| Unit of measure | Dollars per input/output token | Dollars per accepted task |
| Looks cheap when | Model emits fewer billed tokens | Model finishes with fewer retries |
| Misses | Human review time, tool failures, bad assumptions | Harder to collect, but closer to reality |
| Best use | Simple API budgeting | Production agent selection |
The Problem
The non-obvious failure is that agent cost compounds through retries. A cheaper model that misunderstands intent, reopens files repeatedly, burns browser screenshots, or needs human correction can be more expensive than a stronger model with higher token pricing.
| Failure point | What breaks | Why it matters |
|---|---|---|
| Token-only model selection | GPT-5.4 looks cheaper than GPT-5.5 on the rate card | A second or third attempt can erase the savings |
| Browser verification | Agent clicks through UI but checks only superficial page state | False positives ship broken workflows |
| Computer-use workflows | Screenshots and visual reasoning repeat across turns | Cost and latency rise without obvious code changes |
| Long prompts | Large task briefs hide priorities | The agent may overbuild, add unnecessary guardrails, or miss the critical acceptance test |
| Tiny prompts | Context is restated across many turns | The user pays for repeated setup, clarification, and tool planning |
The right metric is not cost per token. The right metric is cost per accepted completion.
The Implementation
Build a task-level evaluation loop around representative internal work. Public benchmarks are useful for press releases and procurement theater. Production selection needs your schemas, your repos, your review standards, your permissions model, and your failure tolerance.
flowchart TD
Eng[Senior engineer] --> Pack[15-task eval pack]
Pack --> MA[Model A — run with prompt contract]
Pack --> MB[Model B — run with prompt contract]
MA --> Repo[read files, patch, run tests]
MB --> Repo
Repo --> Browser[browser assertions and Playwright checks]
Browser --> Log[(eval_results — tokens, retries, elapsed, accepted)]
Log --> Policy[routing policy by task class]
Policy --> Eng
-
Define a task pack from real work. Use 10 to 30 tasks: one frontend fix, one cross-file refactor, one failing test repair, one spreadsheet/report task, one browser-verified workflow, and one ambiguous production bug. Confirm: every task has expected output and acceptance criteria.
-
Write a prompt contract. Include goal, constraints, allowed tools, forbidden actions, verification steps, rollback expectations, and final reporting format. For long-running agents, fewer complete prompts usually beat many tiny prompts because the model carries intent through the run instead of rediscovering it every turn. Confirm: another engineer can run the task without asking what “done” means.
-
Log workflow metrics, not just tokens.
| Metric | Why it belongs |
|---|---|
model | GPT-5.5, GPT-5.4, Claude Opus, local model |
prompt_version | Prevents comparing different instructions |
input_tokens, output_tokens | Still needed, just not sufficient |
retries | Exposes cheap models that need repeated attempts |
wall_clock_seconds | Captures user wait time |
tool_errors | Shows MCP, browser, shell, or permission friction |
human_review_minutes | Often the largest hidden cost |
quality_score | Turns subjective review into comparable data |
accepted | The only number leadership really understands |
Confirm: every run produces one row in agent_eval_results.
-
Add browser assertions, not just browser activity. If the task builds a Trello-style notes app, the verification should create 20 cards, move each card twice, reload, and assert persistence. Watching the cursor move is entertainment. Assertions are engineering. Confirm: the run fails when expected UI state is missing.
-
Route by complexity. Use medium effort for routine CRUD edits, high effort for cross-file refactors, and extra-high only for long-horizon tasks involving planning, implementation, tests, and artifact generation. Confirm: routing policy is written down and reviewed monthly.
In Practice
Context: Public benchmarks such as SWE-bench and vendor agent demos are useful for capability signal, but they do not measure your review time, approval friction, flaky browser runs, or repo-specific retries. I am not claiming a universal cost ranking between models. The claim is narrower: per-token price is incomplete once agents can use tools and repeat work.
Action: A 15-task eval pack that reflects real internal work produces routing policy that generic benchmarks cannot. Representative tasks: a flaky test repair, a cross-file refactor, a data export from a warehouse, and a browser-verified UI flow. Log retries, wall-clock seconds, tool errors, and human review minutes alongside tokens — those four numbers tell a different story than the rate card.
Result: The expected output is not a universal winner. It is routing policy. A stronger model may be cheaper on ambiguous multi-file tasks if it succeeds in fewer passes. A cheaper or lower-effort model may be the right choice for bounded mechanical edits — formatting, scaffolding, narrow refactors — where the task is well-specified and the risk of wrong assumptions is low.
Learning: Browser and computer-use agents need strict permissions regardless of model. Repeated approval prompts, flaky CSS selectors, nondeterministic page timing, and screenshot-heavy loops are not UX friction. They are cost multipliers that make any model more expensive than its token rate suggests.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Strong model overbuilds | Ambiguous prompt says “make it production ready” | Specify scope, non-goals, and acceptance tests |
| Cheap model burns retries | Task requires multi-file reasoning across unfamiliar repo | Route to higher reasoning effort after first failed attempt |
| Browser verification lies | Agent checks page loaded, not state mutation | Use Playwright assertions and persisted test data |
| Tool permission drag | MCP server asks for approval every run | Preconfigure allowed tools per project and keep destructive actions gated |
| Screenshot token burn | Computer-use agent visually inspects every step | Prefer DOM selectors and screenshots only at checkpoints |
| Context window confusion | Team compares words, tokens, and weekly caps as equivalent | Track actual token usage per completed workflow |
| Public benchmark mismatch | Model scores well on coding evals but fails internal workflows | Build eval tasks from real repos, schemas, and review rubrics |
What to Do Next
- Problem: Token pricing hides retries, review time, elapsed time, and tool reliability.
- Solution: Evaluate agents by accepted task completion using real internal workflows.
- Proof: The winning model will vary by task class; routing beats picking one default for everything.
- Action: This week, create a 10-task eval pack and log
model,prompt_version,tokens,retries,elapsed_seconds,tool_errors,review_minutes, andaccepted.