Evaluate AI Agents by Completed Work, Not Token Price

Per-token pricing is the wrong abstraction for AI agents because agents do not sell tokens; they either finish work or create review debt. A large language model, or LLM, predicts and generates text, while an AI agent wraps that model with tools such as browsers, shells, document editors, and code runners. The default approach is token-price comparison; the better approach is task-level evaluation, where GPT-5.5, GPT-5.4, Claude Opus, or any other model is judged by completed work.

Situation

Agentic systems are moving from chat windows into real production workflows: Codex modifying repos, browser-use agents clicking through applications, Claude Desktop calling Model Context Protocol servers, and document agents producing Word, PowerPoint, and spreadsheet artifacts. The pressure is no longer “which model is cheapest per million tokens?” It is “which model finishes the task with the least total operational cost?”

A token is a chunk of text, not a word. Roughly, 1,000 English tokens is about 750 words, so token budgets, context windows, subscription limits, and weekly usage caps are different measurements that should not be casually mixed.

	Token-price comparison	Task-level agent evaluation
Unit of measure	Dollars per input/output token	Dollars per accepted task
Looks cheap when	Model emits fewer billed tokens	Model finishes with fewer retries
Misses	Human review time, tool failures, bad assumptions	Harder to collect, but closer to reality
Best use	Simple API budgeting	Production agent selection

The Problem

The non-obvious failure is that agent cost compounds through retries. A cheaper model that misunderstands intent, reopens files repeatedly, burns browser screenshots, or needs human correction can be more expensive than a stronger model with higher token pricing.

Failure point	What breaks	Why it matters
Token-only model selection	GPT-5.4 looks cheaper than GPT-5.5 on the rate card	A second or third attempt can erase the savings
Browser verification	Agent clicks through UI but checks only superficial page state	False positives ship broken workflows
Computer-use workflows	Screenshots and visual reasoning repeat across turns	Cost and latency rise without obvious code changes
Long prompts	Large task briefs hide priorities	The agent may overbuild, add unnecessary guardrails, or miss the critical acceptance test
Tiny prompts	Context is restated across many turns	The user pays for repeated setup, clarification, and tool planning

The right metric is not cost per token. The right metric is cost per accepted completion.

The Implementation

Build a task-level evaluation loop around representative internal work. Public benchmarks are useful for press releases and procurement theater. Production selection needs your schemas, your repos, your review standards, your permissions model, and your failure tolerance.

flowchart TD
    Eng[Senior engineer] --> Pack[15-task eval pack]
    Pack --> MA[Model A — run with prompt contract]
    Pack --> MB[Model B — run with prompt contract]
    MA --> Repo[read files, patch, run tests]
    MB --> Repo
    Repo --> Browser[browser assertions and Playwright checks]
    Browser --> Log[(eval_results — tokens, retries, elapsed, accepted)]
    Log --> Policy[routing policy by task class]
    Policy --> Eng

Define a task pack from real work. Use 10 to 30 tasks: one frontend fix, one cross-file refactor, one failing test repair, one spreadsheet/report task, one browser-verified workflow, and one ambiguous production bug. Confirm: every task has expected output and acceptance criteria.
Write a prompt contract. Include goal, constraints, allowed tools, forbidden actions, verification steps, rollback expectations, and final reporting format. For long-running agents, fewer complete prompts usually beat many tiny prompts because the model carries intent through the run instead of rediscovering it every turn. Confirm: another engineer can run the task without asking what “done” means.
Log workflow metrics, not just tokens.

Metric	Why it belongs
`model`	GPT-5.5, GPT-5.4, Claude Opus, local model
`prompt_version`	Prevents comparing different instructions
`input_tokens`, `output_tokens`	Still needed, just not sufficient
`retries`	Exposes cheap models that need repeated attempts
`wall_clock_seconds`	Captures user wait time
`tool_errors`	Shows MCP, browser, shell, or permission friction
`human_review_minutes`	Often the largest hidden cost
`quality_score`	Turns subjective review into comparable data
`accepted`	The only number leadership really understands

Confirm: every run produces one row in agent_eval_results.

Add browser assertions, not just browser activity. If the task builds a Trello-style notes app, the verification should create 20 cards, move each card twice, reload, and assert persistence. Watching the cursor move is entertainment. Assertions are engineering. Confirm: the run fails when expected UI state is missing.
Route by complexity. Use medium effort for routine CRUD edits, high effort for cross-file refactors, and extra-high only for long-horizon tasks involving planning, implementation, tests, and artifact generation. Confirm: routing policy is written down and reviewed monthly.

In Practice

Context: Public benchmarks such as SWE-bench and vendor agent demos are useful for capability signal, but they do not measure your review time, approval friction, flaky browser runs, or repo-specific retries. I am not claiming a universal cost ranking between models. The claim is narrower: per-token price is incomplete once agents can use tools and repeat work.

Action: A 15-task eval pack that reflects real internal work produces routing policy that generic benchmarks cannot. Representative tasks: a flaky test repair, a cross-file refactor, a data export from a warehouse, and a browser-verified UI flow. Log retries, wall-clock seconds, tool errors, and human review minutes alongside tokens — those four numbers tell a different story than the rate card.

Result: The expected output is not a universal winner. It is routing policy. A stronger model may be cheaper on ambiguous multi-file tasks if it succeeds in fewer passes. A cheaper or lower-effort model may be the right choice for bounded mechanical edits — formatting, scaffolding, narrow refactors — where the task is well-specified and the risk of wrong assumptions is low.

Learning: Browser and computer-use agents need strict permissions regardless of model. Repeated approval prompts, flaky CSS selectors, nondeterministic page timing, and screenshot-heavy loops are not UX friction. They are cost multipliers that make any model more expensive than its token rate suggests.

Where It Breaks

Failure mode	Trigger	Fix
Strong model overbuilds	Ambiguous prompt says “make it production ready”	Specify scope, non-goals, and acceptance tests
Cheap model burns retries	Task requires multi-file reasoning across unfamiliar repo	Route to higher reasoning effort after first failed attempt
Browser verification lies	Agent checks page loaded, not state mutation	Use Playwright assertions and persisted test data
Tool permission drag	MCP server asks for approval every run	Preconfigure allowed tools per project and keep destructive actions gated
Screenshot token burn	Computer-use agent visually inspects every step	Prefer DOM selectors and screenshots only at checkpoints
Context window confusion	Team compares words, tokens, and weekly caps as equivalent	Track actual token usage per completed workflow
Public benchmark mismatch	Model scores well on coding evals but fails internal workflows	Build eval tasks from real repos, schemas, and review rubrics

What to Do Next

Problem: Token pricing hides retries, review time, elapsed time, and tool reliability.
Solution: Evaluate agents by accepted task completion using real internal workflows.
Proof: The winning model will vary by task class; routing beats picking one default for everything.
Action: This week, create a 10-task eval pack and log model, prompt_version, tokens, retries, elapsed_seconds, tool_errors, review_minutes, and accepted.

Situation

The Problem

The Implementation

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Agent Productivity Depends on Context Throughput

AI Cost Incident Runbook: What to Do When Monthly Token Spend Suddenly Doubles

Agent-to-Agent Review Loops