Per-token pricing is the wrong abstraction for AI agents because agents do not sell tokens; they either finish work or create review debt. A large language model, or LLM, predicts and generates text, while an AI agent wraps that model with tools such as browsers, shells, document editors, and code runners. The default approach is token-price comparison; the better approach is task-level evaluation, where GPT-5.5, GPT-5.4, Claude Opus, or any other model is judged by completed work.

Situation

Agentic systems are moving from chat windows into real production workflows: Codex modifying repos, browser-use agents clicking through applications, Claude Desktop calling Model Context Protocol servers, and document agents producing Word, PowerPoint, and spreadsheet artifacts. The pressure is no longer “which model is cheapest per million tokens?” It is “which model finishes the task with the least total operational cost?”

A token is a chunk of text, not a word. Roughly, 1,000 English tokens is about 750 words, so token budgets, context windows, subscription limits, and weekly usage caps are different measurements that should not be casually mixed.

Token-price comparisonTask-level agent evaluation
Unit of measureDollars per input/output tokenDollars per accepted task
Looks cheap whenModel emits fewer billed tokensModel finishes with fewer retries
MissesHuman review time, tool failures, bad assumptionsHarder to collect, but closer to reality
Best useSimple API budgetingProduction agent selection

The Problem

The non-obvious failure is that agent cost compounds through retries. A cheaper model that misunderstands intent, reopens files repeatedly, burns browser screenshots, or needs human correction can be more expensive than a stronger model with higher token pricing.

Failure pointWhat breaksWhy it matters
Token-only model selectionGPT-5.4 looks cheaper than GPT-5.5 on the rate cardA second or third attempt can erase the savings
Browser verificationAgent clicks through UI but checks only superficial page stateFalse positives ship broken workflows
Computer-use workflowsScreenshots and visual reasoning repeat across turnsCost and latency rise without obvious code changes
Long promptsLarge task briefs hide prioritiesThe agent may overbuild, add unnecessary guardrails, or miss the critical acceptance test
Tiny promptsContext is restated across many turnsThe user pays for repeated setup, clarification, and tool planning

The right metric is not cost per token. The right metric is cost per accepted completion.

The Implementation

Build a task-level evaluation loop around representative internal work. Public benchmarks are useful for press releases and procurement theater. Production selection needs your schemas, your repos, your review standards, your permissions model, and your failure tolerance.

flowchart TD
    Eng[Senior engineer] --> Pack[15-task eval pack]
    Pack --> MA[Model A — run with prompt contract]
    Pack --> MB[Model B — run with prompt contract]
    MA --> Repo[read files, patch, run tests]
    MB --> Repo
    Repo --> Browser[browser assertions and Playwright checks]
    Browser --> Log[(eval_results — tokens, retries, elapsed, accepted)]
    Log --> Policy[routing policy by task class]
    Policy --> Eng
  1. Define a task pack from real work. Use 10 to 30 tasks: one frontend fix, one cross-file refactor, one failing test repair, one spreadsheet/report task, one browser-verified workflow, and one ambiguous production bug. Confirm: every task has expected output and acceptance criteria.

  2. Write a prompt contract. Include goal, constraints, allowed tools, forbidden actions, verification steps, rollback expectations, and final reporting format. For long-running agents, fewer complete prompts usually beat many tiny prompts because the model carries intent through the run instead of rediscovering it every turn. Confirm: another engineer can run the task without asking what “done” means.

  3. Log workflow metrics, not just tokens.

MetricWhy it belongs
modelGPT-5.5, GPT-5.4, Claude Opus, local model
prompt_versionPrevents comparing different instructions
input_tokens, output_tokensStill needed, just not sufficient
retriesExposes cheap models that need repeated attempts
wall_clock_secondsCaptures user wait time
tool_errorsShows MCP, browser, shell, or permission friction
human_review_minutesOften the largest hidden cost
quality_scoreTurns subjective review into comparable data
acceptedThe only number leadership really understands

Confirm: every run produces one row in agent_eval_results.

  1. Add browser assertions, not just browser activity. If the task builds a Trello-style notes app, the verification should create 20 cards, move each card twice, reload, and assert persistence. Watching the cursor move is entertainment. Assertions are engineering. Confirm: the run fails when expected UI state is missing.

  2. Route by complexity. Use medium effort for routine CRUD edits, high effort for cross-file refactors, and extra-high only for long-horizon tasks involving planning, implementation, tests, and artifact generation. Confirm: routing policy is written down and reviewed monthly.

In Practice

Context: Public benchmarks such as SWE-bench and vendor agent demos are useful for capability signal, but they do not measure your review time, approval friction, flaky browser runs, or repo-specific retries. I am not claiming a universal cost ranking between models. The claim is narrower: per-token price is incomplete once agents can use tools and repeat work.

Action: A 15-task eval pack that reflects real internal work produces routing policy that generic benchmarks cannot. Representative tasks: a flaky test repair, a cross-file refactor, a data export from a warehouse, and a browser-verified UI flow. Log retries, wall-clock seconds, tool errors, and human review minutes alongside tokens — those four numbers tell a different story than the rate card.

Result: The expected output is not a universal winner. It is routing policy. A stronger model may be cheaper on ambiguous multi-file tasks if it succeeds in fewer passes. A cheaper or lower-effort model may be the right choice for bounded mechanical edits — formatting, scaffolding, narrow refactors — where the task is well-specified and the risk of wrong assumptions is low.

Learning: Browser and computer-use agents need strict permissions regardless of model. Repeated approval prompts, flaky CSS selectors, nondeterministic page timing, and screenshot-heavy loops are not UX friction. They are cost multipliers that make any model more expensive than its token rate suggests.

Where It Breaks

Failure modeTriggerFix
Strong model overbuildsAmbiguous prompt says “make it production ready”Specify scope, non-goals, and acceptance tests
Cheap model burns retriesTask requires multi-file reasoning across unfamiliar repoRoute to higher reasoning effort after first failed attempt
Browser verification liesAgent checks page loaded, not state mutationUse Playwright assertions and persisted test data
Tool permission dragMCP server asks for approval every runPreconfigure allowed tools per project and keep destructive actions gated
Screenshot token burnComputer-use agent visually inspects every stepPrefer DOM selectors and screenshots only at checkpoints
Context window confusionTeam compares words, tokens, and weekly caps as equivalentTrack actual token usage per completed workflow
Public benchmark mismatchModel scores well on coding evals but fails internal workflowsBuild eval tasks from real repos, schemas, and review rubrics

What to Do Next

  • Problem: Token pricing hides retries, review time, elapsed time, and tool reliability.
  • Solution: Evaluate agents by accepted task completion using real internal workflows.
  • Proof: The winning model will vary by task class; routing beats picking one default for everything.
  • Action: This week, create a 10-task eval pack and log model, prompt_version, tokens, retries, elapsed_seconds, tool_errors, review_minutes, and accepted.