AI Token Cost Is the New Cloud Bill

LLM token spend is the first major infrastructure cost in a decade that scales with usage and design rather than with servers. Most teams are still reading it like a cloud bill from 2018 — by total dollars, after the fact — and that is exactly why it surprises them.

Problem

AI features shipped fast across most engineering orgs, and the bill arrived later. Unlike compute or storage, token cost does not track headcount or provisioned capacity. It tracks how many calls you make, how large each prompt is, which model you route to, and how much context you stuff into every request. A single verbose system prompt, an oversized model used for a trivial classification, or a retrieval pipeline re-embedding the same documents can multiply spend without changing what the user sees.

The result is a cost line nobody forecast and few can explain. The basic question — what does one user interaction actually cost us, and why? — usually has no answer.

Why it matters financially

Token cost compounds in ways that escape dashboards:

It scales with adoption, not provisioning. Success makes it worse. A feature that costs $0.02 per interaction is fine at 10k interactions/month and a budget problem at 10M.
The drivers are multiplicative. Model tier × prompt size × call volume × retries. A 2x prompt on a 3x-priced model at 1.5x retry rate is 9x the cost for the same outcome.
Waste is invisible at the unit level. A few thousand wasted tokens per call is rounding error in one request and a five-figure monthly line at scale.

When you can express cost per request, per user, and per feature, finance and engineering finally share one number — and you can forecast instead of react.

Technical root causes

Model over-selection. Frontier models used for extraction, classification, or formatting that a smaller, cheaper model handles at equivalent quality.
Prompt and context bloat. System prompts that grew by accretion; retrieved context pasted in wholesale rather than ranked and trimmed.
Missing caching. No prompt caching for stable instructions; no result caching for repeated queries.
Redundant retrieval and embedding. Re-embedding unchanged documents; retrieving more chunks than the model needs.
Unbounded retries and fallbacks. Retry storms and fallback-to-larger-model logic that quietly escalate cost.
No unit accounting. Spend is tracked as a monthly total, so no one can attribute it to a feature or fix.

Review checklist

Can you compute cost per request / per user / per feature today?
What share of calls go to a frontier model that a smaller model could serve?
How large is your average prompt, and how much of it is static (cacheable)?
Is prompt caching enabled for stable system instructions?
Are repeated identical queries served from a cache?
Are you re-embedding documents that have not changed?
How many chunks do you retrieve, and does the model need them all?
What is your retry rate, and what does a retry cost?
Do you have a quality guardrail so a cost cut can’t silently degrade output?

Example findings

(Illustrative — from the pattern of real reviews, not a specific client.)

A summarization feature ran every call on a frontier model; a tier-down on the 70% of calls under a length threshold cut that feature’s spend materially with no measurable quality change on the evaluation set.
40% of a support assistant’s prompt was a static instruction block re-sent on every call; enabling prompt caching removed it from per-call cost.
A RAG pipeline re-embedded the entire corpus nightly though <3% of documents changed; switching to change-detection cut embedding spend sharply.

Actions to take

Instrument unit cost first. You cannot optimize what you cannot attribute. Log tokens and model per call, tagged by feature.
Right-size models by task with an evaluation set that guards quality before and after.
Cache the stable parts — system prompts and repeated queries.
Trim context — rank and cap retrieved chunks; cut prompt accretion.
Bound retries and fallbacks and measure what they cost.
Forecast with the per-request model so the next 10x in usage is a planned number, not a surprise.

Where this connects

If you own a database bill, none of this is foreign — it is the same discipline of measuring usage, finding structural waste, and sequencing fixes. The next article in this series, Why Database Engineers Should Care About AI Cost Engineering, makes that case directly.

Want an engineering-grade cost model for your AI workloads? AKS runs an AI Cost Engineering Advisory — read-only, evidence-driven, and focused on cuts that don’t degrade quality. Or start with the free 30-Point Database Cost Review Checklist, or see what a review delivers in the Acme SaaS sample report.

Problem

Why it matters financially

Technical root causes

Review checklist

Example findings

Actions to take

Where this connects

Rajiv

Related Posts

Build vs Buy: The AI Platform Architecture Decision

AI Governance for Engineering Teams: Preventing Shadow AI Spend Without Blocking Innovation

AI Token Cost Overruns: Why AI Coding Assistants Are Becoming the New Cloud Bill Problem