Your alerting channel just fired: the monthly OpenAI billing threshold was breached, and it is only the 12th of the month. You are burning $2,000 a day on unstructured completions, and engineering leadership needs an explanation and a mitigation plan by noon.

Situation

AI features are increasingly embedded into high-throughput critical paths — search ranking, customer support triage, real-time data extraction, autonomous coding pipelines. Unlike traditional compute where scaling costs are linear and predictable, LLM API costs are non-deterministic. A slightly misconfigured system prompt, an unconstrained user input field, or an infinite retry loop on malformed JSON can cause token consumption to spike geometrically overnight.

The operational challenge is that standard APM tools do not surface this. Latency looks normal. Error rate is zero. The API calls are succeeding — they are just silently processing millions of context tokens with no dashborad panel tracking them.

Symptoms

An AI cost incident typically presents through one or more of these signals:

  • Provider billing dashboard shows daily spend 2x–5x above the trailing 7-day average
  • Monthly budget threshold alert fires before mid-month
  • A specific feature’s token usage is growing faster than its request count — the context window is expanding
  • Single workflow session consuming tokens at 10x its expected rate — a retry loop indicator
  • Spend is climbing but no specific feature, user, or deployment can be identified as the source — missing attribution

The absence of attribution is itself a diagnostic signal. If you cannot identify which key, feature, or deployment is responsible within five minutes of a spend alert, your observability is the first problem to fix.

First Five Checks

Run these within the first 10 minutes of an alert. No code changes yet — establish what you know before you act.

# 1. Check provider usage by day — identify when the spike started
# Anthropic: use the console's Usage tab (api.anthropic.com/billing)
# OpenAI: platform.openai.com/usage

# 2. Break down by API key — which key is responsible
# If using Helicone as gateway:
curl -H "Authorization: Bearer $HELICONE_API_KEY" \
  "https://www.helicone.ai/api/v1/request/stats?groupBy=apiKey" | jq .

# 3. Find the largest single requests in the last 24 hours
curl -H "Authorization: Bearer $HELICONE_API_KEY" \
  "https://www.helicone.ai/api/v1/request?sort=totalTokens&order=desc&limit=10" | jq .

# 4. Check for retry storms — failed requests being repeatedly retried
grep "status=429\|status=500" /var/log/ai-gateway/requests.log | \
  awk '{print $1}' | sort | uniq -c | sort -rn | head -20

# 5. Track prompt token count trend — is average prompt size growing?
curl -H "Authorization: Bearer $HELICONE_API_KEY" \
  "https://www.helicone.ai/api/v1/request/stats?groupBy=hour&metric=promptTokens" | jq .

If you do not have a proxy gateway, check the provider’s usage console directly. All major providers (Anthropic, OpenAI, Google) expose per-key breakdowns in their billing dashboards. The key is to identify the unit of attribution — key, feature, or deployment — before moving to mitigation.

Decision Tree

flowchart TD
    A[Spend Alert Fires] --> B{Can you attribute spend to a specific key or feature?}
    B -->|No| D[Enable request logging — tag all requests with feature and user ID]
    B -->|Yes| C{Is it a retry loop — same session consuming 10x expected tokens?}
    C -->|Yes| E[Disable retry logic — apply circuit breaker at gateway]
    C -->|No| F{Is prompt token count growing without request count growing?}
    F -->|Yes| G[Reduce max context — drop RAG chunk count or document length]
    F -->|No| H[Check for new deployment — compare prompt template to baseline]
    E --> I[Apply fix — redeploy with budget guard]
    G --> I
    H --> I
    D --> J[Wait 30 minutes — re-triage with attribution data]

The decision tree has one upstream blocker: if you cannot attribute spend to a feature or key, all downstream branches are unreachable. Fixing attribution is always the first remediation for an unattributed spike.

Remediation Options

Option 1 — Hard spend cap (immediate, reversible) Set a per-key or per-organization spending limit directly in the provider console. Anthropic and OpenAI both support monthly hard limits. This stops the bleeding immediately but may break features. Use this when the spike is severe and root cause is unknown.

Option 2 — Context size reduction (targeted, low disruption) If the spike is caused by context window expansion — RAG pipelines fetching larger documents, an upstream data source change injecting bloated records — reduce the maximum number of retrieved chunks or the max document length. Reduce top_k in your vector store from 10 to 3. Reduce max document length from 2000 tokens to 500. This is fully reversible.

Option 3 — Circuit breaker (targeted, moderate disruption) If the spike is caused by a retry loop — an agent repeatedly retrying on malformed JSON, a webhook re-processing the same event — apply a circuit breaker at the API gateway layer. After N failed attempts per session, return a cached or degraded response without hitting the provider.

Option 4 — Model tier downgrade (immediate, quality tradeoff) If attribution shows a single feature is consuming disproportionate spend, route that feature to a smaller model temporarily. This provides immediate cost relief but degrades output quality. Test with a small percentage of traffic before full rollover.

The documented pattern from Cloudflare AI Gateway and Vercel AI SDK is that all four of these levers should be pre-built and deployable in minutes, not improvised during an incident. Rate limiting rules, fallback model routes, and context size caps are standing configuration — not incident response code.

Rollback Plan

If a remediation makes things worse — feature breaks, quality degrades unacceptably — rollback in this order:

  1. Revert the most recent AI-related deployment: Check git log for any prompt template, model version, or RAG configuration changes in the past 48 hours. A single system prompt change is the most common source of context window expansion.
  2. Re-enable the previous API key: If you rotated keys during triage, the old key is the rollback path. Ensure the new key is disabled, not just de-provisioned.
  3. Restore context limits incrementally: If you reduced context and the feature is returning degraded results, restore in steps (500 → 1000 → 2000 tokens) and measure cost and quality at each step.
  4. Restore the original model tier: If you downgraded model routing, restore the original. Document the quality delta before and after for the post-incident review.

Do not roll back to the pre-incident state without understanding root cause. You will reproduce the same spike within days.

Automation Opportunity

These checks should not require manual intervention during an incident. Each can be built once and deployed as standing infrastructure:

Manual step todayAutomated withEstimated effort
Per-key spend breakdownHelicone or LiteLLM proxy with Grafana panelLow — hours
Budget threshold alertingProvider billing alerts wired to PagerDuty or SlackLow — hours
Automatic circuit breaker on retry stormAPI gateway rate-limit policy by session IDLow — hours
Feature-level attribution headersMiddleware that injects X-Feature-ID on every outbound requestMedium — days
Context window size trendingCustom metric from gateway request logsMedium — days
Automated model downgrade on budget thresholdLiteLLM fallback routing rule triggered by spend rateMedium — days

Vercel’s AI SDK provides built-in per-request token usage tracking that maps spend to specific routes without a proxy gateway. Cloudflare AI Gateway provides edge-layer rate limiting and caching as a deployment configuration. Neither requires custom application code — they require deployment and configuration decisions that are easiest to make before the first incident.

Leadership Summary

When leadership needs the update by noon, they need three things: what happened, what stopped it, and what will prevent recurrence.

Template:

We detected an anomalous spike in LLM API spend starting [DATE] caused by [CAUSE — context window growth / retry loop / new feature deployment / misrouted traffic]. We contained it by [ACTION — applying a spend cap / reducing context size / adding a circuit breaker]. Current daily spend is back to $[X]. Root cause was [ONE SENTENCE]. To prevent recurrence, we are [SPECIFIC CHANGE — adding attribution headers / deploying rate limit policy / implementing context size caps]. Expected completion: [DATE].

If you cannot fill in every blank in that template, you have not finished the first five checks. An incident summary that says “we are investigating” is not a summary — it is a status update that confirms leadership has no visibility into their AI spend.

What to Do Next

  • Problem: LLM API spend is non-deterministic and standard APM tools do not surface context window growth or retry storms until the billing alarm fires.
  • Solution: Deploy an API proxy gateway with per-request attribution headers, set hard monthly spend limits at the provider level, and implement circuit breakers on retry patterns before the first incident.
  • Proof: Cloudflare AI Gateway and Vercel AI SDK provide the attribution and rate-limiting primitives described in this runbook — both are documented, deployed configuration, not custom code.
  • Action: Audit whether your current AI workloads have per-request attribution headers and a hard monthly spend cap configured at the provider. If either is missing, those are the two changes to make this week.